Problem set 5¶

The objective of this problem set is to put together everything you have learned this semester. For this reason, the problem set has some parts that are a bit open ended. In these parts, you will need to use your judgment as to what is the best way to do something (make sure you make explicit your reasoning!). In other parts, the questions will be precise (and hopefully your answer will be so as well).

In this problem set, you will work with data collected from Airbnb by a third party. The objective of the problem set, ultimately, is to learn about Airbnb listings, whether Airbnb hosts are operate as single- or multi-home managers, identify locations in which Airbnb listings are concentrated, which listings compete with each other, etc.

This is a relatively long problem set. For this reason, please do not wait until the last moment to work on it. Doing so will ensure that you won't be able to finish. Because the semester is ending, I will not be able to provide extensions unless these follow University rules.

0. Workflow and preliminaries¶

Organize your workflow. Create a directory dedicated to this problem set. Inside this directory, create directories called RawData, ProcessedData, Code, and Figures. You will store the data you get from the Internet in the RawData directory. Intermediate and final datasets created by you will go to ProcessedData. The code you write will be saved to Code and the figures you produced will be saved to Figures. However, you will turn in your Jupyter Notebook so make sure that all the output is both displayed on the notebook and saved to the relevant directories.

1. Learning about the data¶

Go to http://insideairbnb.com/:

  • Visit the entire website. Make sure you check every page in the website, read about the motivation that led to the creation of the website, the policies related to data access and management, etc.
  • Go to Data and to Get the data. You will see that there are many datasets from many different cities.
  • Choose an interesting city. This means that the city must be big enough so there are many listings, and it must have interesting locations for tourists (which we'll use later).
  • Scroll down and download the file listings.csv.gz for the city that you chose. Download this file to your computer and save it in RawData. Extract to the same directory the file that is inside the compressed file.
  • Create a Jupyter Notebool called problem_set_5_lastname_firstname.ipynb and save it to Code.

2. Examining the data¶

I'm going to be vague here but there are several things you should do. Doing these, however, is the bare minimum and will ensure you get 60 percent of the points of this section. You should think carefully about other statistics that may be of interest.

  • Import modules that we normaly used during the course. Examples of these include Pandas, Numpy, Folium, Seaborn, etc.
  • Load the data to your notebook.
  • Recognize which variables are numeric, which ones are strings, which ones are such that could be considered boolean, etc. You don't need to tell me which variables are what, but you do need to make sure that when we work with some of these variables, these have the correct format.
  • We will restrict the analysis to the following variables: id, name,host_id,host_since, host_location,host_acceptance_rate,host_is_superhost, host_neighbourhood, neighbourhood_cleansed,latitude,longitude, property_type, room_type, accommodates, bathrooms,bedrooms, beds, price,minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_communication, review_scores_location, review_scores_value. Therefore, make sure you drop all other columns. If these covariates do not exist for the city you chose, choose a different city! If you find a covariate that I didn't add to the list and that you consider relevant for the analysis, add it and make sure you tell me about it and why you want to include it.
  • How many listings are there?
  • How many hosts?
  • What is the average number of reviews per listing? The average rating per listing?
  • Provide a table with summary statistics (e.g., mean, median, standard deviation, minimum, maximum) for all the variables for which this analysis is relevant. If you learn that some variables are not useful, drop them from the dataframe.
  • Produce the same correlation matrix plot we did in the ML module for the numeric variables that correspond to characteristics of the listings, but exclude review_scores_accuracy, review_scores_communication, review_scores_location, review_scores_value, minimum_nights, maximum_nights.

2.1. Import modules and read the data¶

In [2]:
# In this cell, import all the relevant modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from shapely.geometry import Polygon, Point, LineString
import folium
import requests
import geopandas as gpd
In [3]:
#desktop path 
rel_path = r'C:\\Users\\Hunter\\Documents\\econ471\\problemset5\\'
in_path = r'C:\\Users\\Hunter\\Documents\\econ471\\problemset5\\rawdata'
in_path = rel_path + r'\\rawdata\\'
out_path = rel_path + r'\\processeddata\\'
out_pathfig = rel_path + r'\\figures\\'
In [483]:
# In this cell, define the relevant paths
rel_path = r'C:\\Users\\Keesu\\Documents\\econ471\\problemset5\\'
in_path = r'C:\\Users\Hunter\Documents\econ471\problemset5\rawdata'
in_path = rel_path + r'\\rawdata\\'
out_path = rel_path + r'\\processeddata\\'
out_pathfig = rel_path + r'\\figures\\'
In [4]:
# In this cell, define the name of the file you will import
fname = 'listings.csv'
df = pd.read_csv(in_path + fname, keep_default_na=True)
df = pd.DataFrame(df)
In [5]:
# In this cell, read in the data
df.head(2290)
Out[5]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2384 https://www.airbnb.com/rooms/2384 20230912032549 2023-09-12 city scrape Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... You are invited stay in the guest room of my v... The apartment is less than one block from beau... https://a0.muscache.com/pictures/acf6b3c0-47f2... 2613 ... 4.99 4.96 4.94 R17000015609 f 1 0 1 0 2.15
1 7126 https://www.airbnb.com/rooms/7126 20230912032549 2023-09-12 city scrape Rental unit in Chicago · ★4.70 · 1 bedroom · 1... A very small studio in a wonderful neighborhoo... Ukrainian Village was just named "Hottest Neig... https://a0.muscache.com/pictures/51073/16c81c7... 17928 ... 4.87 4.89 4.75 R21000075737 f 1 1 0 0 2.92
2 10945 https://www.airbnb.com/rooms/10945 20230912032549 2023-09-12 city scrape Rental unit in Chicago · ★4.65 · 2 bedrooms · ... Beautiful first floor apartment in Historic Ol... NaN https://a0.muscache.com/pictures/58d1a420-a24b... 33004 ... 4.82 4.99 4.65 2209984 t 7 7 0 0 0.63
3 12140 https://www.airbnb.com/rooms/12140 20230912032549 2023-09-12 city scrape Boutique hotel in Chicago · ★4.93 · 1 bedroom ... Bed and Breakfast license issued by the City o... The Guest House is in the heart of Chicago's m... https://a0.muscache.com/pictures/miso/Hosting-... 46734 ... 5.00 5.00 4.87 R20000055258 f 1 0 1 0 0.15
4 24833 https://www.airbnb.com/rooms/24833 20230912032549 2023-09-12 city scrape Rental unit in Chicago · ★4.29 · 1 bedroom · 1... Perfect private entrance apartment one block f... Lincoln Park is a great neighborhood where peo... https://a0.muscache.com/pictures/ac17be6f-40bd... 101521 ... 4.03 4.90 4.20 City registration pending f 4 4 0 0 0.26
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2285 37091430 https://www.airbnb.com/rooms/37091430 20230912032549 2023-09-12 city scrape Condo in Chicago · ★4.83 · 2 bedrooms · 3 beds... Brand New Private Modern 2 bd 2bth apt for you... This neighborhood is alive! Pilsen has a vinta... https://a0.muscache.com/pictures/73e3f383-b4d2... 88547320 ... 4.95 4.72 4.83 R20000052495 f 10 10 0 0 5.51
2286 37107960 https://www.airbnb.com/rooms/37107960 20230912032549 2023-09-12 city scrape Rental unit in Chicago · 1 bedroom · 1 bed · 1... Show up and start living from day one in Chica... This furnished rental is in the historic Gold ... https://a0.muscache.com/pictures/prohost-api/H... 107434423 ... NaN NaN NaN 32+ Days Listing f 645 645 0 0 NaN
2287 37113529 https://www.airbnb.com/rooms/37113529 20230912032549 2023-09-12 city scrape Condo in Chicago · ★4.80 · 3 bedrooms · 5 beds... My Chicago themed home is a perfect retreat on... For a feel of the youthful, creative energy on... https://a0.muscache.com/pictures/prohost-api/H... 11981271 ... 4.95 4.91 4.84 R22000081935 t 3 3 0 0 1.70
2288 37125424 https://www.airbnb.com/rooms/37125424 20230912032549 2023-09-12 city scrape Condo in Chicago · ★4.95 · 1 bedroom · 1 bed ·... Our home is a lovely city terrace modern loft ... The area hosts the Finest Restaurants, and co... https://a0.muscache.com/pictures/pro_photo_too... 8540884 ... 5.00 5.00 4.66 R19000046100 f 1 1 0 0 0.85
2289 37143206 https://www.airbnb.com/rooms/37143206 20230912032549 2023-09-12 city scrape Home in Chicago · ★5.0 · 1 bedroom · 1 bed · 1... We are located by the I-290 and a 10 minute wa... Ronak’s place is located in Chicago, Illinois,... https://a0.muscache.com/pictures/e5c0bd38-ce61... 110725131 ... 5.00 3.25 5.00 R19000042272 f 5 0 5 0 0.08

2290 rows × 75 columns

2.2 Clean the data¶

We will restrict attention to only some variables. We'll drop variables that have too many missing values.

In [6]:
# In this cell restrict the dataframe to the columns identified above.
df = df[['id', 'name','host_id','host_since', 'host_location','host_acceptance_rate','host_is_superhost', 'host_neighbourhood', 'neighbourhood_cleansed','latitude','longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms','bedrooms', 'beds', 'price','minimum_nights', 'maximum_nights', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_communication', 'review_scores_location', 'review_scores_value']]
df= pd.DataFrame(df)
df.head(2890)
# If you decide to keep a variable not mentioned above, remember to tell me why you are doing it. I want to know your thought process!
Out[6]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... beds price minimum_nights maximum_nights number_of_reviews review_scores_rating review_scores_accuracy review_scores_communication review_scores_location review_scores_value
0 2384 Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... 2613 2008-08-29 Chicago, IL 92% t Hyde Park Hyde Park 41.787900 ... 1.0 $114.00 3 60 227 4.99 4.98 4.99 4.96 4.94
1 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... 1.0 $92.00 32 60 505 4.70 4.84 4.87 4.89 4.75
2 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... 2.0 $170.00 4 1125 72 4.65 4.76 4.82 4.99 4.65
3 12140 Boutique hotel in Chicago · ★4.93 · 1 bedroom ... 46734 2009-10-18 Chicago, IL 100% f Lincoln Park Lincoln Park 41.923570 ... 1.0 $329.00 2 10 15 4.93 4.67 5.00 5.00 4.87
4 24833 Rental unit in Chicago · ★4.29 · 1 bedroom · 1... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.925961 ... 1.0 $61.00 32 395 42 4.29 4.07 4.03 4.90 4.20
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2885 42900122 Rental unit in Chicago · ★4.91 · 3 bedrooms · ... 63239107 2016-03-16 Chicago, IL 100% t NaN Grand Boulevard 41.812620 ... 4.0 $172.00 1 21 56 4.91 4.95 5.00 4.71 4.84
2886 42910760 Rental unit in Chicago · ★4.25 · Studio · 2 be... 76244 2010-02-01 Chicago, IL 28% f Lincoln Park Lincoln Park 41.919310 ... 2.0 $200.00 5 1125 20 4.25 4.65 4.85 4.90 4.10
2887 42920816 Rental unit in Chicago · 1 bedroom · 1 bed · 1... 107434423 2016-12-16 New York, NY 97% f Cambridge Near West Side 41.883056 ... 1.0 $170.00 32 1125 2 5.00 5.00 5.00 5.00 5.00
2888 42929197 Home in Chicago · ★4.79 · 2 bedrooms · 2 beds ... 22998917 2014-10-26 Chicago, IL 90% f Humboldt Park Humboldt Park 41.908940 ... 2.0 $177.00 3 1125 62 4.79 4.89 4.92 4.73 4.77
2889 42933476 Home in Chicago · ★4.93 · 1 bedroom · 1 bed · ... 47056445 2015-10-20 Chicago, IL 85% t South Shore South Shore 41.762340 ... 1.0 $12.00 11 1125 14 4.93 4.93 5.00 4.64 4.86

2890 rows × 26 columns

2.3. Examining the data¶

How many listings are there? How many hosts? Are all variables in the correct format?

In [7]:
# How many listings are there?
counts = len(df)
print(f"Number of listings in df: {counts}")
Number of listings in df: 8528
In [8]:
# How many hosts?
unique_hosts = df['host_id'].unique()
pd.DataFrame(unique_hosts, columns=['individual_host'])
#We can see that there are a total of 3834 different hosts. This means that there are a lot of hosts that own multiple units.
Out[8]:
individual_host
0 2613
1 17928
2 33004
3 46734
4 101521
... ...
3829 395141659
3830 415825853
3831 468284714
3832 97961430
3833 536514763

3834 rows × 1 columns

In [9]:
df.dtypes
Out[9]:
id                               int64
name                            object
host_id                          int64
host_since                      object
host_location                   object
host_acceptance_rate            object
host_is_superhost               object
host_neighbourhood              object
neighbourhood_cleansed          object
latitude                       float64
longitude                      float64
property_type                   object
room_type                       object
accommodates                     int64
bathrooms                      float64
bedrooms                       float64
beds                           float64
price                           object
minimum_nights                   int64
maximum_nights                   int64
number_of_reviews                int64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
dtype: object
In [10]:
price_cor_id = 14167586
new_price = '75'
#for some reason this one listing has the wrong price, they had it listed as 7580 instead of just 75, i went to 
#the posting in the csv file
df.loc[df['id'] == price_cor_id, 'price'] = new_price
In [11]:
search_id = 14167586
result = df[df['id'] == search_id]
result
Out[11]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... beds price minimum_nights maximum_nights number_of_reviews review_scores_rating review_scores_accuracy review_scores_communication review_scores_location review_scores_value
673 14167586 Rental unit in Chicago · ★4.80 · 1 bedroom · 1... 27232871 2015-02-04 Chicago, IL 100% f Ukrainian Village West Town 41.89863 ... 1.0 75 32 1125 213 4.8 4.85 4.86 4.92 4.86

1 rows × 26 columns

In [12]:
# Create a new variable called `df['price2']` that is equal to `price` but of the right type and with the right format.
df['price2'] = df['price']
df['price2'] = df['price2'].str.replace('$', '')
df['price2'] = df['price2'].str.replace(',', '')
df['price2'] = df['price2'].astype('float')
#you have to transfer as a float before an integer
df['price2'] = df['price2'].astype('Int64') 
df.head(2290)
C:\Users\Hunter\AppData\Local\Temp\ipykernel_2176\934704436.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df['price2'] = df['price2'].str.replace('$', '')
Out[12]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... price minimum_nights maximum_nights number_of_reviews review_scores_rating review_scores_accuracy review_scores_communication review_scores_location review_scores_value price2
0 2384 Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... 2613 2008-08-29 Chicago, IL 92% t Hyde Park Hyde Park 41.787900 ... $114.00 3 60 227 4.99 4.98 4.99 4.96 4.94 114
1 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... $92.00 32 60 505 4.70 4.84 4.87 4.89 4.75 92
2 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... $170.00 4 1125 72 4.65 4.76 4.82 4.99 4.65 170
3 12140 Boutique hotel in Chicago · ★4.93 · 1 bedroom ... 46734 2009-10-18 Chicago, IL 100% f Lincoln Park Lincoln Park 41.923570 ... $329.00 2 10 15 4.93 4.67 5.00 5.00 4.87 329
4 24833 Rental unit in Chicago · ★4.29 · 1 bedroom · 1... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.925961 ... $61.00 32 395 42 4.29 4.07 4.03 4.90 4.20 61
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2285 37091430 Condo in Chicago · ★4.83 · 2 bedrooms · 3 beds... 88547320 2016-08-07 NaN 97% t Lower West Side Lower West Side 41.856750 ... $99.00 1 1000 263 4.83 4.90 4.95 4.72 4.83 99
2286 37107960 Rental unit in Chicago · 1 bedroom · 1 bed · 1... 107434423 2016-12-16 New York, NY 97% f Cambridge Near North Side 41.898904 ... $126.00 32 1125 0 NaN NaN NaN NaN NaN 126
2287 37113529 Condo in Chicago · ★4.80 · 3 bedrooms · 5 beds... 11981271 2014-02-05 Chicago, IL 87% t West Town/Noble Square West Town 41.891286 ... $132.00 1 365 85 4.80 4.75 4.95 4.91 4.84 132
2288 37125424 Condo in Chicago · ★4.95 · 1 bedroom · 1 bed ·... 8540884 2013-08-30 Chicago, IL 100% t West Loop/Greektown Near West Side 41.886380 ... $253.00 10 365 41 4.95 4.95 5.00 5.00 4.66 253
2289 37143206 Home in Chicago · ★5.0 · 1 bedroom · 1 bed · 1... 110725131 2017-01-09 Houston, TX 43% f Garfield Park West Garfield Park 41.877450 ... $42.00 30 1125 4 5.00 4.75 5.00 3.25 5.00 42

2290 rows × 27 columns

In [13]:
df['host_acceptance_rate2'] = df['host_acceptance_rate'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('Int64')  
df.dtypes
Out[13]:
id                               int64
name                            object
host_id                          int64
host_since                      object
host_location                   object
host_acceptance_rate            object
host_is_superhost               object
host_neighbourhood              object
neighbourhood_cleansed          object
latitude                       float64
longitude                      float64
property_type                   object
room_type                       object
accommodates                     int64
bathrooms                      float64
bedrooms                       float64
beds                           float64
price                           object
minimum_nights                   int64
maximum_nights                   int64
number_of_reviews                int64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
price2                           Int64
host_acceptance_rate2            Int64
dtype: object
In [14]:
# Create a new variable called `df['host_acceptance_rate2']` that is equal to `host_acceptance_rate` but of the right type and with the right format.
df['host_acceptance_rate2'] = df['host_acceptance_rate']
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('float')
In [15]:
# Summary statistics of _relevant_ variables. Remember to first check that each relevant variable has the right format/is of the right type. Modify the variables if needed.
#i modified the other variables above so I can work on them here
# Create a new variable called `df['price2']` that is equal to `price` but of the right type and with the right format.
df['price2'] = df['price']
df['price2'] = df['price2'].str.replace('$', '')
df['price2'] = df['price2'].str.replace(',', '')
df['price2'] = df['price2'].astype('float')
#you have to transfer as a float before an integer
df['price2'] = df['price2'].astype('Int64') 

#host acceptance rate
df['host_acceptance_rate2'] = df['host_acceptance_rate'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('Int64')  

# Create a new variable called `df['host_is_superhost2']` that is equal to `host_is_superhost` but of the right type and with the right format.
df['host_is_superhost2'] = df['host_is_superhost'].replace({'f': 0, 't': 1}).astype(bool)



rel_var = df[['price2','host_acceptance_rate2', 'host_is_superhost2','bathrooms','beds', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy']]
rel_var.describe()
C:\Users\Hunter\AppData\Local\Temp\ipykernel_2176\1227629468.py:5: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df['price2'] = df['price2'].str.replace('$', '')
Out[15]:
price2 host_acceptance_rate2 bathrooms beds number_of_reviews review_scores_rating review_scores_accuracy
count 8528.0 7779.0 0.0 8482.000000 8528.000000 6864.000000 6852.000000
mean 198.283185 91.912585 NaN 2.245461 46.595333 4.739273 4.789812
std 255.263629 17.881487 NaN 1.705293 87.779889 0.454732 0.388242
min 12.0 0.0 NaN 1.000000 0.000000 0.000000 1.000000
25% 87.0 94.0 NaN 1.000000 1.000000 4.687500 4.750000
50% 135.0 98.0 NaN 2.000000 14.000000 4.860000 4.900000
75% 214.0 100.0 NaN 3.000000 54.000000 4.980000 5.000000
max 5000.0 100.0 NaN 24.000000 3332.000000 5.000000 5.000000
In [16]:
df.dtypes
Out[16]:
id                               int64
name                            object
host_id                          int64
host_since                      object
host_location                   object
host_acceptance_rate            object
host_is_superhost               object
host_neighbourhood              object
neighbourhood_cleansed          object
latitude                       float64
longitude                      float64
property_type                   object
room_type                       object
accommodates                     int64
bathrooms                      float64
bedrooms                       float64
beds                           float64
price                           object
minimum_nights                   int64
maximum_nights                   int64
number_of_reviews                int64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
price2                           Int64
host_acceptance_rate2            Int64
host_is_superhost2                bool
dtype: object

There are some columns we may want to transform to numeric format. For example, price, host_acceptance_rate, and host_is_superhost appear to be in a wrong format. Price, for example, is stored as strings (at least is not a mixed type variable!). We need to clean it and make it a numeric variable.

Something similar happens with host_acceptance_rate. Transform this variable into percentages (0.75, for example, instead of 75%).

In [17]:
# Create a new variable called `df['host_acceptance_rate2']` that is equal to `host_acceptance_rate` but of the right type and with the right format.
df['host_acceptance_rate2'] = df['host_acceptance_rate']
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('float')

host_is_superhost seems to be boolean but it is not. It is supposed to be boolean, with f for 0 and t for 1. Transform it to numeric, with 1 instead of t and 0 instead of f.

In [18]:
# Create a new variable called `df['host_is_superhost2']` that is equal to `host_is_superhost` but of the right type and with the right format.
df['host_is_superhost2'] = df['host_is_superhost'].replace({'f': 0, 't': 1}).astype(bool)
In [19]:
missing_values = df['bathrooms'].isna()

#every single row is missing baths, I should just drop it below
rows_with_missing_values = df[missing_values]
rows_with_missing_values
Out[19]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... maximum_nights number_of_reviews review_scores_rating review_scores_accuracy review_scores_communication review_scores_location review_scores_value price2 host_acceptance_rate2 host_is_superhost2
0 2384 Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... 2613 2008-08-29 Chicago, IL 92% t Hyde Park Hyde Park 41.787900 ... 60 227 4.99 4.98 4.99 4.96 4.94 114 92.0 True
1 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... 60 505 4.70 4.84 4.87 4.89 4.75 92 95.0 True
2 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... 1125 72 4.65 4.76 4.82 4.99 4.65 170 95.0 True
3 12140 Boutique hotel in Chicago · ★4.93 · 1 bedroom ... 46734 2009-10-18 Chicago, IL 100% f Lincoln Park Lincoln Park 41.923570 ... 10 15 4.93 4.67 5.00 5.00 4.87 329 100.0 False
4 24833 Rental unit in Chicago · ★4.29 · 1 bedroom · 1... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.925961 ... 395 42 4.29 4.07 4.03 4.90 4.20 61 90.0 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8523 976998614768513719 Rental unit in Chicago · ★New · 5 bedrooms · 8... 11120895 2014-01-08 Chicago, IL 98% t Tri-Taylor Near West Side 41.875333 ... 365 0 NaN NaN NaN NaN NaN 438 98.0 True
8524 977144971555557664 Home in Chicago · ★New · 1 bedroom · 1 bed · 2... 260588213 2019-05-08 Chicago, IL 100% f McKinley Park Mckinley Park 41.825997 ... 365 0 NaN NaN NaN NaN NaN 32 100.0 False
8525 977746614717655088 Home in Chicago · ★New · 3 bedrooms · 3 beds ·... 536514763 2023-09-10 NaN NaN f West Englewood West Englewood 41.763133 ... 365 0 NaN NaN NaN NaN NaN 24 NaN False
8526 977980712187801208 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 107434423 2016-12-16 New York, NY 97% f Cambridge Loop 41.869479 ... 1125 0 NaN NaN NaN NaN NaN 127 97.0 False
8527 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... 365 0 NaN NaN NaN NaN NaN 366 99.0 False

8528 rows × 29 columns

In [20]:
df.drop('bathrooms', axis=1, inplace=True)
In [21]:
rel_var = df[['price2','host_acceptance_rate2', 'host_is_superhost2','beds', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy']]
rel_var.describe()
Out[21]:
price2 host_acceptance_rate2 beds number_of_reviews review_scores_rating review_scores_accuracy
count 8528.0 7779.000000 8482.000000 8528.000000 6864.000000 6852.000000
mean 198.283185 91.912585 2.245461 46.595333 4.739273 4.789812
std 255.263629 17.881487 1.705293 87.779889 0.454732 0.388242
min 12.0 0.000000 1.000000 0.000000 0.000000 1.000000
25% 87.0 94.000000 1.000000 1.000000 4.687500 4.750000
50% 135.0 98.000000 2.000000 14.000000 4.860000 4.900000
75% 214.0 100.000000 3.000000 54.000000 4.980000 5.000000
max 5000.0 100.000000 24.000000 3332.000000 5.000000 5.000000
While doing the first walkthrough of my data, I was perplexed that there was a couple of places that were actually renting as low as 12 dollars a night. This signifys that there are actually a good amount of places where a person can get a good deal for so low. The highest priced places were either large mansion sized houses or high-rises which had a large amount of beds, types of places that I did not even know could be rented out this way through AirBnB.

In the data for the city that I chose, the variable bathrooms is odd. Is this the case for the city you chose? Look at the data. What should you do with this variable? Do it. After doing that, turn to compute the summary statistics of the data and describe them!

Make sure you provide an interesting discussion of your summary statistics. If you only provide a table without any commentary/discussion, this exercise is useless. Remember, you are analyzing the data and you must explain it to a third person who has never seen these data (e.g., your boss) and you want to help this person to learn from the data. If you only provide a list of facts, then you are not doing your job as the analyst.

2.4. Create a correlation matrix for selected variables¶

Specifically, create a correlation matrix plot for review_scores_accuracy, review_scores_communication, review_scores_location,review_scores_value,minimum_nights, maximum_nights

In [22]:
# Correlation matrix
matrix = df[['review_scores_accuracy','review_scores_communication','review_scores_location','review_scores_value','minimum_nights','maximum_nights']]
matrix.corr()
Out[22]:
review_scores_accuracy review_scores_communication review_scores_location review_scores_value minimum_nights maximum_nights
review_scores_accuracy 1.000000 0.717416 0.513477 0.789684 -0.022795 -0.015459
review_scores_communication 0.717416 1.000000 0.405164 0.705978 -0.029796 -0.044169
review_scores_location 0.513477 0.405164 1.000000 0.525811 -0.055962 0.006788
review_scores_value 0.789684 0.705978 0.525811 1.000000 -0.037684 -0.027540
minimum_nights -0.022795 -0.029796 -0.055962 -0.037684 1.000000 0.131693
maximum_nights -0.015459 -0.044169 0.006788 -0.027540 0.131693 1.000000

2.5 Let's look at the data¶

Look at prices. Do you think the data looks reasonable? What types of properties and rooms are in the data?

Create a new dataframe called, for example, df2 that restricts attention to room_type == Entire home/apt. Once you have that, create another dataframe, called df3 that takes df2 and restricts attention to the following property_types: Entire residential home, Entire rental unit, Entire condominium (condo), Entire guesthouse, Entire serviced apartment, Entire townhouse, and Entire bungalow.

In [23]:
# Let's look at price
price_analysis = df[['price2']]
price_analysis.describe()
Out[23]:
price2
count 8528.0
mean 198.283185
std 255.263629
min 12.0
25% 87.0
50% 135.0
75% 214.0
max 5000.0

There are a couple of things that call our attention here. First, what is the minimum price? The 25th percentile? The 99th percentile? The maximum price? Does this suggest that we need to do further cleaning of the data? Why?

To do this, note that we have two variables called room_type and property_type. Let's restrict first to room_type == Entire home/apt.

In [24]:
# Create `df2` as defined above
df2 = pd.DataFrame()
df2 = df[df['room_type'] == 'Entire home/apt']

Let's restrict the analysis, from now on, to properties of the types: Entire residential home, Entire rental unit, Entire condominium (condo), Entire guesthouse, Entire serviced apartment, Entire townhouse, and Entire bungalow.

In [25]:
# Create `df3` as defined above
prop_types = [
    'Entire residential home',
    'Entire rental unit',
    'Entire condominium (condo)',
    'Entire guesthouse',
    'Entire serviced apartment',
    'Entire townhouse',
    'Entire bungalow'
]

df3 = df2[df2['property_type'].isin(prop_types)]

2.6. Let's filter the data on price now¶

Specifically, create df4 that takes df3 and keeps observations that have prices that are above the 5th and below the 95th percentile of the distribution of price in df3. Plot the distribution of prices (remember that price is a continuous variable!). How many listings are left? How many hosts?

In [26]:
# Create `df4` as defined above
price_5th_percentile = df3['price2'].quantile(0.05)
price_95th_percentile = df3['price2'].quantile(0.95)

# Create df4 by filtering df3 based on the specified price range
df4 = df3[(df3['price2'] > price_5th_percentile) & (df3['price2'] < price_95th_percentile)]
In [27]:
# Plot the distribution
plt.figure(figsize=(10, 6))
plt.hist(df4['price2'], bins=30, color='lightgreen', edgecolor='black')
plt.title('Distribution of Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
    
plt.show()

How many listings remain? Unique hosts?

In [28]:
# Calculate the number of listings and unique hosts.
listings = len(df4)
hosts = df4['host_id'].nunique()

print(f"Number of listings in df4: {listings}")
print(f"Number of unique hosts in df4: {hosts}")
Number of listings in df4: 4241
Number of unique hosts in df4: 1919

3. Plot the data¶

Now that we have a cleaned dataset (though we'll clean it more later...), use folium to plot the locations.

In [29]:
Chicago = folium.Map(location=[41.87466362949316, -87.62469034613902], zoom_start=10.5)

for index, row in df3.iterrows():
    folium.Marker([row['latitude'], row['longitude']], tooltip=row['name']).add_to(Chicago)

Chicago
Out[29]:
Make this Notebook Trusted to load map: File -> Trust Notebook

4. Let's learn about hosts¶

There are fewer hosts than listings. This means that some hosts manage more than one listing.

4.1 Let's create a dataframe of hosts¶

This dataframe, called df_hosts, should have, for each unique host, the total number of listings that she manages, the average price of the listings she manages, and its super host status. The idea is to check if hosts that manage more than one listing charge higher or lower prices on average and to examine how the average price varies with the number of listings managed by each host (and the same for super host status).

In [30]:
# Create `df_hosts`. A good way to do this is to use `groupby` focusing on the different variables of interest and the operations you want to do with each of them.
#Group by 'host_id' and calculate total number of listings, average price, and super host status
df_hosts = df4.groupby('host_id').agg({
    'id': 'count',               #Total number of listings
    'price2': 'mean',            #Average price
    'host_is_superhost': 'max'   #Superhost status
}).reset_index()

df_hosts.columns = ['host_id', 'total_listings', 'average_price', 'superhost_status']

df_hosts
Out[30]:
host_id total_listings average_price superhost_status
0 2153 10 101.9 f
1 7529 1 135.0 t
2 17928 1 92.0 t
3 31673 1 141.0 t
4 33004 4 155.5 t
... ... ... ... ...
1914 531889983 1 265.0 f
1915 532571501 1 261.0 f
1916 533116573 1 106.0 f
1917 533550346 2 148.5 f
1918 535521512 1 150.0 f

1919 rows × 4 columns

In [31]:
# Make a scatterplot (`sns.scatterplot`) that has the number of listings per host on the horizontal axis and the mean price charged by that host on the vertical axis.
plt.figure(figsize=(12, 8))
sns.scatterplot(x='total_listings', y='average_price', data=df_hosts)
plt.title('Number of Listings per Host vs Mean Price')
plt.xlabel('Number of Listings per Host')
plt.ylabel('Mean Price')
plt.show()

It is clear that there are few hosts with many listings. Let's identify these and drop them, and their listings, from the data. Before doing this, however, let's see what type of listings belong to these hosts.

4.2. Drop hosts that have more than 50 listings¶

Take df4 and create a new dataframe, df5 that drops hosts with more than 50 listings (and those listings as well).

In [32]:
# Identify the hosts that have more than 50 listings (use logical indexing to do this)
more_than50 = df4.groupby('host_id').agg({
    'id': 'count'  #Total number of listings
}).reset_index()
more_than50.columns = ['host_id', 'total_listings']
to_drop = more_than50[more_than50['total_listings'] > 50]
In [33]:
to_drop
Out[33]:
host_id total_listings
123 3965428 73
608 47172572 57
925 107434423 645
1581 395498979 61
In [34]:
top_hosts = df4[df4['host_id'].isin(to_drop['host_id'])]
top_hosts
Out[34]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... maximum_nights number_of_reviews review_scores_rating review_scores_accuracy review_scores_communication review_scores_location review_scores_value price2 host_acceptance_rate2 host_is_superhost2
59 1083987 Rental unit in Chicago · ★5.0 · 1 bedroom · 2 ... 3965428 2012-10-24 Chicago, IL 69% t South Loop/Printers Row Near South Side 41.853950 ... 1123 23 5.00 5.00 5.00 4.81 4.69 120 69.0 True
82 1732652 Rental unit in Chicago · ★4.67 · 1 bedroom · 1... 3965428 2012-10-24 Chicago, IL 69% t South Loop/Printers Row Near North Side 41.892690 ... 365 7 4.67 4.40 4.67 5.00 4.50 153 69.0 True
95 1944802 Rental unit in Chicago · ★4.69 · 2 bedrooms · ... 3965428 2012-10-24 Chicago, IL 69% t South Loop/Printers Row Near South Side 41.854870 ... 1125 14 4.69 5.00 5.00 4.46 4.62 200 69.0 True
112 2583929 Rental unit in Chicago · ★4.92 · 1 bedroom · 2... 3965428 2012-10-24 Chicago, IL 69% t South Loop/Printers Row Near South Side 41.852260 ... 1122 16 4.92 4.92 5.00 4.69 4.77 145 69.0 True
113 2661465 Rental unit in Chicago · ★4.71 · 1 bedroom · 1... 3965428 2012-10-24 Chicago, IL 69% t South Loop/Printers Row Loop 41.887650 ... 365 25 4.71 4.71 4.71 4.92 4.63 160 69.0 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8507 975295012005513617 Rental unit in Chicago · ★New · 2 bedrooms · 2... 107434423 2016-12-16 New York, NY 97% f Cambridge West Town 41.894776 ... 1125 0 NaN NaN NaN NaN NaN 102 97.0 False
8508 975295340703444048 Rental unit in Chicago · ★New · 2 bedrooms · 2... 107434423 2016-12-16 New York, NY 97% f Cambridge West Town 41.894776 ... 1125 0 NaN NaN NaN NaN NaN 121 97.0 False
8513 975802849898108433 Rental unit in Chicago · ★New · 2 bedrooms · 2... 107434423 2016-12-16 New York, NY 97% f Cambridge Lake View 41.940411 ... 1125 0 NaN NaN NaN NaN NaN 229 97.0 False
8514 975806375396218857 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 107434423 2016-12-16 New York, NY 97% f Cambridge Loop 41.869480 ... 1125 0 NaN NaN NaN NaN NaN 148 97.0 False
8526 977980712187801208 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 107434423 2016-12-16 New York, NY 97% f Cambridge Loop 41.869479 ... 1125 0 NaN NaN NaN NaN NaN 127 97.0 False

836 rows × 28 columns

In [35]:
df4
Out[35]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... maximum_nights number_of_reviews review_scores_rating review_scores_accuracy review_scores_communication review_scores_location review_scores_value price2 host_acceptance_rate2 host_is_superhost2
1 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... 60 505 4.70 4.84 4.87 4.89 4.75 92 95.0 True
2 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... 1125 72 4.65 4.76 4.82 4.99 4.65 170 95.0 True
5 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... 300 51 4.30 4.57 4.13 4.92 4.35 95 90.0 True
10 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... 30 17 5.00 5.00 5.00 5.00 5.00 105 92.0 True
11 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... 1125 72 4.78 4.81 4.61 4.89 4.73 204 99.0 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8517 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... 365 0 NaN NaN NaN NaN NaN 248 90.0 False
8521 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... 365 0 NaN NaN NaN NaN NaN 150 100.0 True
8522 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... 122 0 NaN NaN NaN NaN NaN 176 NaN False
8526 977980712187801208 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 107434423 2016-12-16 New York, NY 97% f Cambridge Loop 41.869479 ... 1125 0 NaN NaN NaN NaN NaN 127 97.0 False
8527 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... 365 0 NaN NaN NaN NaN NaN 366 99.0 False

4241 rows × 28 columns

In [36]:
df5 = df4[~df4['host_id'].isin(top_hosts['host_id'])]
df5
Out[36]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... maximum_nights number_of_reviews review_scores_rating review_scores_accuracy review_scores_communication review_scores_location review_scores_value price2 host_acceptance_rate2 host_is_superhost2
1 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... 60 505 4.70 4.84 4.87 4.89 4.75 92 95.0 True
2 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... 1125 72 4.65 4.76 4.82 4.99 4.65 170 95.0 True
5 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... 300 51 4.30 4.57 4.13 4.92 4.35 95 90.0 True
10 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... 30 17 5.00 5.00 5.00 5.00 5.00 105 92.0 True
11 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... 1125 72 4.78 4.81 4.61 4.89 4.73 204 99.0 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8511 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... 365 0 NaN NaN NaN NaN NaN 265 96.0 True
8517 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... 365 0 NaN NaN NaN NaN NaN 248 90.0 False
8521 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... 365 0 NaN NaN NaN NaN NaN 150 100.0 True
8522 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... 122 0 NaN NaN NaN NaN NaN 176 NaN False
8527 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... 365 0 NaN NaN NaN NaN NaN 366 99.0 False

3405 rows × 28 columns

4.3. Merge the information in your dataframe of "small hosts" (mean price and number of listings managed by each host) with that in df5¶

In [37]:
# Merge the information as described above
small_hosts = df5.groupby('host_id').agg({
    'id': 'count',               #Total number of listings
    'price2': 'mean',            #Average price
    'host_is_superhost': 'max'   #Superhost status
}).reset_index()
small_hosts.columns = ['host_id', 'total_listings', 'average_price', 'superhost_status']
merged_df = pd.merge(df5, small_hosts, on='host_id', how='left')

merged_df
Out[37]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... review_scores_accuracy review_scores_communication review_scores_location review_scores_value price2 host_acceptance_rate2 host_is_superhost2 total_listings average_price superhost_status
0 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... 4.84 4.87 4.89 4.75 92 95.0 True 1 92.0 t
1 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... 4.76 4.82 4.99 4.65 170 95.0 True 4 155.5 t
2 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... 4.57 4.13 4.92 4.35 95 90.0 True 3 130.333333 t
3 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... 5.00 5.00 5.00 5.00 105 92.0 True 1 105.0 t
4 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... 4.81 4.61 4.89 4.73 204 99.0 False 20 244.5 f
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3400 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... NaN NaN NaN NaN 265 96.0 True 1 265.0 t
3401 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... NaN NaN NaN NaN 248 90.0 False 16 291.6875 f
3402 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... NaN NaN NaN NaN 150 100.0 True 1 150.0 t
3403 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... NaN NaN NaN NaN 176 NaN False 1 176.0 f
3404 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... NaN NaN NaN NaN 366 99.0 False 26 147.576923 f

3405 rows × 31 columns

5. Econometrics¶

In this section, we'll use statsmodels, patsy, and linearmodels as we did in the ML handout to examine how listing and host attributes are related to listing price. Note that some variables have many missing observations so while we won't drop them explicitly, linearmodels won't consider those observations in the analysis.

Define the dataframe panel by taking df5 and restricting to df5[['host_id','bedrooms', 'beds','number_of_reviews','review_scores_rating','price2','host_is_superhost2', 'n_listings']].

You will also need to define df5['time']=1 to serve as a time fixed effect (that we won't use) and set panel = panel.set_index(['host_id','time']) as the indexes to be used in estimation. Then proceed to add a constant to panel when defining the exogenous variables that go into the regression (we did this in the ML handout). Define the dependent variable to be the natural logarithm of the price.

5.1 Create five additional variables¶

Choose five locations in the city that you chose, that you consider would be of interest to a tourist. Make sure these locations are not near each other but rather distant from each other (this will result in generating clusters of related listings). Then, compute the linear distance (i.e., Haversine) from each listing to the five locations of interest and add this information to df5. Do not compute driving distances.

In [38]:
df5 = merged_df
In [39]:
df5['time']=1
In [40]:
panel = df5[['host_id','bedrooms', 'beds','number_of_reviews','review_scores_rating','price2','host_is_superhost2','total_listings']]
In [41]:
panel = df5.set_index(['host_id','time'])
In [42]:
from shapely.geometry import Polygon, Point, LineString
from pyproj import CRS
In [43]:
lat_point_list = [41.64098624147555, 41.612691128743016, 42.27230259745743, 42.18361979589204]
lon_point_list = [-88.03606370170382,-87.31309668383324,-87.71827343224777,-88.23539863535352]

polygon_geom =  Polygon(zip(lon_point_list, lat_point_list))
In [44]:
crs = CRS('epsg:4326')
polygon = gpd.GeoDataFrame(index=[0], crs=crs, geometry=[polygon_geom])
polygon['name']='Bounding_box'
print(polygon.geometry)
0    POLYGON ((-88.03606 41.64099, -87.31310 41.612...
Name: geometry, dtype: geometry
In [45]:
polygon.crs
Out[45]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
In [46]:
tourists_dict = {
    'latitude': [41.88266066881203, 41.92163043325434, 42.14395570837075, 41.88221147788901, 41.79240081830397],
    'longitude': [-87.62334364343683, -87.6340144787165, -87.79010853741309, -87.71624427481888, -87.58362020107282]
}

tourist_gdf = gpd.GeoDataFrame(
    {'Location': ['the_bean', 'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory', 'science_museum'],
     'geometry': [Point(lon, lat) for lat, lon in zip(tourists_dict['latitude'], tourists_dict['longitude'])]},
    crs="EPSG:4326"
)
In [47]:
tourist_gdf = gpd.GeoDataFrame(
    {'Location': ['the_bean', 'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory', 'science_museum'],
     'geometry': [Point(lat, lon) for lat, lon in zip(tourists_dict['latitude'], tourists_dict['longitude'])]},
    crs="EPSG:4326"
)

tourist_gdf = tourist_gdf.set_index('Location')

df5['the_bean'] = tourist_gdf.loc['the_bean', 'geometry']
df5['lincoln_park_zoo'] = tourist_gdf.loc['lincoln_park_zoo', 'geometry']
df5['botanic_garden'] = tourist_gdf.loc['botanic_garden', 'geometry']
df5['garfield_conservatory'] = tourist_gdf.loc['garfield_conservatory', 'geometry']
df5['science_museum'] = tourist_gdf.loc['science_museum', 'geometry']
In [48]:
geometry_points = [Point(lat, lon) for lat, lon in zip(df5['latitude'], df5['longitude'])]
df5['point'] = geometry_points

df5 = gpd.GeoDataFrame(df5, geometry='point', crs="EPSG:4326")
In [49]:
df5
Out[49]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... total_listings average_price superhost_status time the_bean lincoln_park_zoo botanic_garden garfield_conservatory science_museum point
0 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... 1 92.0 t 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.90166 -87.68021)
1 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... 4 155.5 t 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.91196 -87.63981)
2 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... 3 130.333333 t 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.92499 -87.65573)
3 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... 1 105.0 t 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.97960 -87.66512)
4 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... 20 244.5 f 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94342 -87.68121)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3400 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... 1 265.0 t 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.86309 -87.62668)
3401 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... 16 291.6875 f 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.89183 -87.63820)
3402 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... 1 150.0 t 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.87319 -87.63408)
3403 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... 1 176.0 f 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.88723 -87.64413)
3404 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... 26 147.576923 f 1 POINT (41.88266066881203 -87.62334364343683) POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94847 -87.65723)

3405 rows × 38 columns

In [50]:
# Write down the function that you will use to loop over all listings and compute the distance to the points of interest. 
# You need to pass a dataframe as input (for example, `df5`) and get back a dataframe with all the information of the 
# original dataframe plus two columns with distance to the locations of interest. Call the output dataframe `df6`
from math import radians, sin, cos, atan2, sqrt
df6 = df5
def haversine_distance(coord1, coord2):
    # Haversine formula
    lat1, lon1 = coord1.y, coord1.x
    lat2, lon2 = coord2.y, coord2.x

    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    radius_of_earth = 6371000  # Radius of Earth in meters
    distance = radius_of_earth * c

    return distance
df6['distance_to_the_bean'] = df5.apply(lambda row: haversine_distance(row['point'], row['the_bean']), axis=1)
df6['distance_to_lincoln_park_zoo'] = df5.apply(lambda row: haversine_distance(row['point'], row['lincoln_park_zoo']), axis=1)
df6['distance_to_botanic_garden'] = df5.apply(lambda row: haversine_distance(row['point'], row['botanic_garden']), axis=1)
df6['distance_to_garfield_conservatory'] = df5.apply(lambda row: haversine_distance(row['point'], row['garfield_conservatory']), axis=1)
df6['distance_to_science_museum'] = df5.apply(lambda row: haversine_distance(row['point'], row['science_museum']), axis=1)
In [51]:
df6
Out[51]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... lincoln_park_zoo botanic_garden garfield_conservatory science_museum point distance_to_the_bean distance_to_lincoln_park_zoo distance_to_botanic_garden distance_to_garfield_conservatory distance_to_science_museum
0 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.90166 -87.68021) 6323.842702 5137.509592 12266.427808 4007.769743 10752.015508
1 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.91196 -87.63981) 1835.918505 645.955946 16744.020648 8500.159874 6272.529685
2 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.92499 -87.65573) 3606.412223 2414.704591 14973.467194 6731.620259 8041.584980
3 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.97960 -87.66512) 4666.399409 3468.863401 13916.954743 5701.480087 9103.333700
4 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94342 -87.68121) 6440.393583 5248.836639 12140.954107 3905.207423 10873.631977
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3400 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.86309 -87.62668) 381.965796 858.650326 18215.047133 9959.287607 4799.461418
3401 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.89183 -87.63820) 1652.200318 484.782145 16928.675828 8678.524792 6086.181005
3402 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.87319 -87.63408) 1195.067467 222.496468 17390.673562 9135.896752 5623.819314
3403 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.88723 -87.64413) 2311.432585 1135.778420 16271.812047 8018.773304 6742.687130
3404 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94847 -87.65723) 3780.016837 2584.357797 14800.601024 6568.821704 8216.681882

3405 rows × 43 columns

5.2 Show, on a folium map the location of the listings in df5 as well as of the locations of interest. Use one color for all the listings and a different color for the two locations of interest.¶

In [52]:
from folium.plugins import MarkerCluster

df6 = pd.DataFrame(df5) 

map_center = [df5['latitude'].mean(), df5['longitude'].mean()]
mymap = folium.Map(location=map_center, zoom_start=12)

marker_cluster = MarkerCluster().add_to(mymap)
for index, row in df5.iterrows():
    folium.Marker([row['latitude'], row['longitude']], popup=index, icon=folium.Icon(color='blue')).add_to(marker_cluster)

locations_of_interest = {
    'the_bean': {'latitude': 41.88266066881203, 'longitude': -87.62334364343683},
    'lincoln_park_zoo': {'latitude': 41.92163043325434, 'longitude': -87.6340144787165},
    'botanic_garden': {'latitude': 42.14395570837075, 'longitude': -87.79010853741309},
    'garfield_conservatory': {'latitude': 41.88221147788901, 'longitude': -87.71624427481888},
    'science_museum': {'latitude': 41.79240081830397, 'longitude': -87.58362020107282},
}
for location, coords in locations_of_interest.items():
    folium.Marker([coords['latitude'], coords['longitude']], popup=location, icon=folium.Icon(color='red')).add_to(mymap)

mymap
Out[52]:
Make this Notebook Trusted to load map: File -> Trust Notebook

5.3 Analysis¶

In [53]:
import statsmodels.api as sm
from patsy import dmatrices
from linearmodels.panel import PanelOLS
In [54]:
df5
Out[54]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... lincoln_park_zoo botanic_garden garfield_conservatory science_museum point distance_to_the_bean distance_to_lincoln_park_zoo distance_to_botanic_garden distance_to_garfield_conservatory distance_to_science_museum
0 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.90166 -87.68021) 6323.842702 5137.509592 12266.427808 4007.769743 10752.015508
1 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.91196 -87.63981) 1835.918505 645.955946 16744.020648 8500.159874 6272.529685
2 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.92499 -87.65573) 3606.412223 2414.704591 14973.467194 6731.620259 8041.584980
3 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.97960 -87.66512) 4666.399409 3468.863401 13916.954743 5701.480087 9103.333700
4 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94342 -87.68121) 6440.393583 5248.836639 12140.954107 3905.207423 10873.631977
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3400 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.86309 -87.62668) 381.965796 858.650326 18215.047133 9959.287607 4799.461418
3401 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.89183 -87.63820) 1652.200318 484.782145 16928.675828 8678.524792 6086.181005
3402 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.87319 -87.63408) 1195.067467 222.496468 17390.673562 9135.896752 5623.819314
3403 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.88723 -87.64413) 2311.432585 1135.778420 16271.812047 8018.773304 6742.687130
3404 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94847 -87.65723) 3780.016837 2584.357797 14800.601024 6568.821704 8216.681882

3405 rows × 43 columns

In [55]:
# Do all the data manipulation as described above and in the ML module
#My dependent variable is price2

panel = df5[['host_id', 'bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'total_listings','time']]
panel = panel.set_index(['host_id', 'time'])

panel['constant'] = 1
panel = panel.dropna()
dependent = panel.price2
exog = sm.add_constant(panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'total_listings']])
In [56]:
panel.dtypes
Out[56]:
bedrooms                float64
beds                    float64
number_of_reviews         int64
review_scores_rating    float64
price2                    Int64
host_is_superhost2         bool
total_listings            int64
constant                  int64
dtype: object
In [57]:
panel
Out[57]:
bedrooms beds number_of_reviews review_scores_rating price2 host_is_superhost2 total_listings constant
host_id time
17928 1 1.0 1.0 505 4.70 92 True 1 1
33004 1 2.0 2.0 72 4.65 170 True 4 1
101521 1 2.0 3.0 51 4.30 95 True 3 1
504470 1 1.0 1.0 17 5.00 105 True 1 1
683529 1 3.0 3.0 72 4.78 204 False 20 1
... ... ... ... ... ... ... ... ... ...
518917937 1 1.0 2.0 1 1.00 136 False 26 1
243035563 1 2.0 3.0 1 5.00 248 False 30 1
66596522 1 1.0 3.0 1 5.00 230 False 2 1
235678953 1 2.0 3.0 1 3.00 216 True 16 1
67751745 1 2.0 2.0 1 5.00 367 False 7 1

2719 rows × 8 columns

5.4. Estimate the linear model without fixed effects and discuss your findings. In doing so, first estimate the linear model without adding the distances to the touristic attractions and then estimate the model again with the touristic attractions. Did anything happen to your estimated parameters when you included the distance to touristic attractions? What happened with the price coefficient?¶

In [58]:
df5
Out[58]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... lincoln_park_zoo botanic_garden garfield_conservatory science_museum point distance_to_the_bean distance_to_lincoln_park_zoo distance_to_botanic_garden distance_to_garfield_conservatory distance_to_science_museum
0 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.90166 -87.68021) 6323.842702 5137.509592 12266.427808 4007.769743 10752.015508
1 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.91196 -87.63981) 1835.918505 645.955946 16744.020648 8500.159874 6272.529685
2 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.92499 -87.65573) 3606.412223 2414.704591 14973.467194 6731.620259 8041.584980
3 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.97960 -87.66512) 4666.399409 3468.863401 13916.954743 5701.480087 9103.333700
4 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94342 -87.68121) 6440.393583 5248.836639 12140.954107 3905.207423 10873.631977
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3400 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.86309 -87.62668) 381.965796 858.650326 18215.047133 9959.287607 4799.461418
3401 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.89183 -87.63820) 1652.200318 484.782145 16928.675828 8678.524792 6086.181005
3402 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.87319 -87.63408) 1195.067467 222.496468 17390.673562 9135.896752 5623.819314
3403 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.88723 -87.64413) 2311.432585 1135.778420 16271.812047 8018.773304 6742.687130
3404 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... POINT (41.92163043325434 -87.6340144787165) POINT (42.14395570837075 -87.79010853741309) POINT (41.88221147788901 -87.71624427481888) POINT (41.79240081830397 -87.58362020107282) POINT (41.94847 -87.65723) 3780.016837 2584.357797 14800.601024 6568.821704 8216.681882

3405 rows × 43 columns

In [63]:
# Estimate the model without fixed effects as we did in the ML module. First do this without the distance variables and then include these and re-estimate. What happened with the estimated coefficients? Did anything happen with the price coefficient?

# Model without distance variables
dependent_variable = panel['price2']
independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'host_is_superhost2', 'total_listings']]

modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
---------------------------------------------------------------------------
AbsorbingEffectError                      Traceback (most recent call last)
Cell In[63], line 8
      5 independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'host_is_superhost2', 'total_listings']]
      7 modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False)
----> 8 resFE = modFE.fit(cov_type='unadjusted')
      9 print(resFE)

File C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\model.py:1856, in PanelOLS.fit(self, use_lsdv, use_lsmr, low_memory, cov_type, debiased, auto_df, count_effects, **cov_config)
   1854 if self.entity_effects or self.time_effects or self.other_effects:
   1855     if not self._drop_absorbed:
-> 1856         check_absorbed(x, [str(var) for var in self.exog.vars])
   1857     else:
   1858         # TODO: Need to special case the constant here when determining which
   1859         #  to retain since we always want to retain the constant if present
   1860         retain = not_absorbed(x, self._constant, self._constant_index)

File C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\utility.py:435, in check_absorbed(x, variables, x_orig)
    433     absorbed_variables = "\n".join(rows)
    434     msg = absorbing_error_msg.format(absorbed_variables=absorbed_variables)
--> 435     raise AbsorbingEffectError(msg)
    436 if x_orig is None:
    437     return

AbsorbingEffectError: 
The model cannot be estimated. The included effects have fully absorbed
one or more of the variables. This occurs when one or more of the dependent
variable is perfectly explained using the effects included in the model.

The following variables or variable combinations have been fully absorbed
or have become perfectly collinear after effects are removed:

          host_is_superhost2
          total_listings

Set drop_absorbed=True to automatically drop absorbed variables.

That means the two variables are perfectly explained by our dependent variable of price, I'll have to drop them to continue. The beds and and bedrooms will obviously have a positive correlation, it looks like that +1 bed creates an expected value of the price increasing by 12 dollars and +1 bedrooms having an increase of 18 dollars. The p values are suggesting that the data we are finding is also statistically significant as well

In [64]:
dependent_variable = panel['price2']
independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating']]

modFE = PanelOLS(dependent_variable, independent_variables, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 price2   R-squared:                        0.8431
Estimator:                   PanelOLS   R-squared (Between):              0.8636
No. Observations:                2719   R-squared (Within):               0.0000
Date:                Mon, Dec 04 2023   R-squared (Overall):              0.8431
Time:                        05:36:23   Log-likelihood                -1.555e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      3648.3
Entities:                        1669   P-value                           0.0000
Avg Obs:                       1.6291   Distribution:                  F(4,2715)
Min Obs:                       1.0000                                           
Max Obs:                       29.000   F-statistic (robust):             3648.3
                                        P-value                           0.0000
Time periods:                       1   Distribution:                  F(4,2715)
Avg Obs:                       2719.0                                           
Min Obs:                       2719.0                                           
Max Obs:                       2719.0                                           
                                                                                
                                  Parameter Estimates                                   
========================================================================================
                      Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
----------------------------------------------------------------------------------------
bedrooms                 18.321     2.5293     7.2435     0.0000      13.362      23.281
beds                     12.985     1.7139     7.5761     0.0000      9.6241      16.346
number_of_reviews       -0.0689     0.0164    -4.2042     0.0000     -0.1010     -0.0368
review_scores_rating     21.586     0.7593     28.430     0.0000      20.098      23.075
========================================================================================


In [65]:
panel_withdistance = df5[['host_id', 'bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'total_listings','time', 'distance_to_the_bean','distance_to_lincoln_park_zoo','distance_to_botanic_garden','distance_to_garfield_conservatory','distance_to_science_museum']]
panel_withdistance = panel_withdistance.set_index(['host_id', 'time'])
In [66]:
dependent_variable = panel_withdistance['price2']
independent_variables = panel_withdistance[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'distance_to_the_bean','distance_to_lincoln_park_zoo','distance_to_botanic_garden','distance_to_garfield_conservatory','distance_to_science_museum']]

modFE = PanelOLS(dependent_variable, independent_variables,time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 price2   R-squared:                        0.8635
Estimator:                   PanelOLS   R-squared (Between):              0.8727
No. Observations:                2719   R-squared (Within):               0.0000
Date:                Mon, Dec 04 2023   R-squared (Overall):              0.8635
Time:                        05:36:24   Log-likelihood                -1.536e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      1904.7
Entities:                        1669   P-value                           0.0000
Avg Obs:                       1.6291   Distribution:                  F(9,2710)
Min Obs:                       1.0000                                           
Max Obs:                       29.000   F-statistic (robust):             1904.7
                                        P-value                           0.0000
Time periods:                       1   Distribution:                  F(9,2710)
Avg Obs:                       2719.0                                           
Min Obs:                       2719.0                                           
Max Obs:                       2719.0                                           
                                                                                
                                         Parameter Estimates                                         
=====================================================================================================
                                   Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-----------------------------------------------------------------------------------------------------
bedrooms                              23.945     2.3964     9.9924     0.0000      19.247      28.644
beds                                  10.726     1.6044     6.6855     0.0000      7.5804      13.872
number_of_reviews                    -0.0534     0.0154    -3.4604     0.0005     -0.0837     -0.0232
review_scores_rating                  10.164     3.3717     3.0145     0.0026      3.5526      16.775
distance_to_the_bean                 -0.0211     0.0032    -6.5499     0.0000     -0.0274     -0.0148
distance_to_lincoln_park_zoo          0.0052     0.0026     2.0068     0.0449      0.0001      0.0102
distance_to_botanic_garden           -0.0004     0.0009    -0.4482     0.6540     -0.0021      0.0013
distance_to_garfield_conservatory     0.0045     0.0007     6.2075     0.0000      0.0031      0.0059
distance_to_science_museum            0.0124     0.0014     9.2035     0.0000      0.0098      0.0151
=====================================================================================================


C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\model.py:1214: MissingValueWarning: 
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)

5.5. Add entity_effects that correspond to the host_id. Add drop_absorbed = True when defining the problem.¶

Let's now control for host FE to capture, for example, the ability of a host (as host with multiple listings may be more skilled). Note that some covariates will be dropped because the fixed effect will be collinear with any variable that doesn't change across listings (in particular, distance to touristic attractions and any variable that doesn't change within observations of the same listing). Discuss your findings.

In [67]:
panel
Out[67]:
bedrooms beds number_of_reviews review_scores_rating price2 host_is_superhost2 total_listings constant
host_id time
17928 1 1.0 1.0 505 4.70 92 True 1 1
33004 1 2.0 2.0 72 4.65 170 True 4 1
101521 1 2.0 3.0 51 4.30 95 True 3 1
504470 1 1.0 1.0 17 5.00 105 True 1 1
683529 1 3.0 3.0 72 4.78 204 False 20 1
... ... ... ... ... ... ... ... ... ...
518917937 1 1.0 2.0 1 1.00 136 False 26 1
243035563 1 2.0 3.0 1 5.00 248 False 30 1
66596522 1 1.0 3.0 1 5.00 230 False 2 1
235678953 1 2.0 3.0 1 3.00 216 True 16 1
67751745 1 2.0 2.0 1 5.00 367 False 7 1

2719 rows × 8 columns

In [68]:
# Estimate the model with fixed effects as we did in the ML module
from linearmodels.panel import PanelOLS

dependent_variable = panel['price2']
independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating']]

modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False, drop_absorbed=True)
resFE = modFE.fit()  
print(resFE)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 price2   R-squared:                        0.3276
Estimator:                   PanelOLS   R-squared (Between):              0.5795
No. Observations:                2719   R-squared (Within):               0.0000
Date:                Mon, Dec 04 2023   R-squared (Overall):              0.5504
Time:                        05:36:26   Log-likelihood                -1.249e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      127.41
Entities:                        1669   P-value                           0.0000
Avg Obs:                       1.6291   Distribution:                  F(4,1046)
Min Obs:                       1.0000                                           
Max Obs:                       29.000   F-statistic (robust):             127.41
                                        P-value                           0.0000
Time periods:                       1   Distribution:                  F(4,1046)
Avg Obs:                       2719.0                                           
Min Obs:                       2719.0                                           
Max Obs:                       2719.0                                           
                                                                                
                                  Parameter Estimates                                   
========================================================================================
                      Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
----------------------------------------------------------------------------------------
bedrooms                 23.193     2.8357     8.1789     0.0000      17.629      28.757
beds                     12.269     1.9634     6.2489     0.0000      8.4165      16.122
number_of_reviews       -0.0133     0.0214    -0.6196     0.5356     -0.0552      0.0287
review_scores_rating    -1.7912     3.7711    -0.4750     0.6349     -9.1910      5.6086
========================================================================================

F-test for Poolability: 5.3126
P-value: 0.0000
Distribution: F(1668,1046)

Included effects: Entity
In [69]:
panel_withdistance
Out[69]:
bedrooms beds number_of_reviews review_scores_rating price2 host_is_superhost2 total_listings distance_to_the_bean distance_to_lincoln_park_zoo distance_to_botanic_garden distance_to_garfield_conservatory distance_to_science_museum
host_id time
17928 1 1.0 1.0 505 4.70 92 True 1 6323.842702 5137.509592 12266.427808 4007.769743 10752.015508
33004 1 2.0 2.0 72 4.65 170 True 4 1835.918505 645.955946 16744.020648 8500.159874 6272.529685
101521 1 2.0 3.0 51 4.30 95 True 3 3606.412223 2414.704591 14973.467194 6731.620259 8041.584980
504470 1 1.0 1.0 17 5.00 105 True 1 4666.399409 3468.863401 13916.954743 5701.480087 9103.333700
683529 1 3.0 3.0 72 4.78 204 False 20 6440.393583 5248.836639 12140.954107 3905.207423 10873.631977
... ... ... ... ... ... ... ... ... ... ... ... ... ...
415825853 1 2.0 2.0 0 NaN 265 True 1 381.965796 858.650326 18215.047133 9959.287607 4799.461418
177188450 1 1.0 2.0 0 NaN 248 False 16 1652.200318 484.782145 16928.675828 8678.524792 6086.181005
441260321 1 1.0 1.0 0 NaN 150 True 1 1195.067467 222.496468 17390.673562 9135.896752 5623.819314
97961430 1 2.0 2.0 0 NaN 176 False 1 2311.432585 1135.778420 16271.812047 8018.773304 6742.687130
518917937 1 3.0 4.0 0 NaN 366 False 26 3780.016837 2584.357797 14800.601024 6568.821704 8216.681882

3405 rows × 12 columns

In [70]:
dependent_variable = panel_withdistance['price2']
independent_variables = panel_withdistance[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'distance_to_the_bean','distance_to_lincoln_park_zoo','distance_to_botanic_garden','distance_to_garfield_conservatory','distance_to_science_museum']]

modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False, drop_absorbed=True)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 price2   R-squared:                        0.3431
Estimator:                   PanelOLS   R-squared (Between):              0.8418
No. Observations:                2719   R-squared (Within):               0.0000
Date:                Mon, Dec 04 2023   R-squared (Overall):              0.8360
Time:                        05:36:27   Log-likelihood                -1.246e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      60.407
Entities:                        1669   P-value                           0.0000
Avg Obs:                       1.6291   Distribution:                  F(9,1041)
Min Obs:                       1.0000                                           
Max Obs:                       29.000   F-statistic (robust):             60.407
                                        P-value                           0.0000
Time periods:                       1   Distribution:                  F(9,1041)
Avg Obs:                       2719.0                                           
Min Obs:                       2719.0                                           
Max Obs:                       2719.0                                           
                                                                                
                                         Parameter Estimates                                         
=====================================================================================================
                                   Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-----------------------------------------------------------------------------------------------------
bedrooms                              23.337     2.8139     8.2936     0.0000      17.816      28.859
beds                                  12.488     1.9494     6.4062     0.0000      8.6631      16.313
number_of_reviews                    -0.0207     0.0213    -0.9727     0.3309     -0.0624      0.0210
review_scores_rating                 -0.4863     3.7657    -0.1291     0.8973     -7.8755      6.9029
distance_to_the_bean                 -0.0171     0.0058    -2.9544     0.0032     -0.0284     -0.0057
distance_to_lincoln_park_zoo          0.0068     0.0034     1.9855     0.0473   7.962e-05      0.0135
distance_to_botanic_garden            0.0031     0.0045     0.6962     0.4864     -0.0056      0.0118
distance_to_garfield_conservatory     0.0024     0.0012     1.9179     0.0554  -5.481e-05      0.0048
distance_to_science_museum            0.0131     0.0072     1.8296     0.0676     -0.0009      0.0271
=====================================================================================================

F-test for Poolability: 4.6278
P-value: 0.0000
Distribution: F(1668,1041)

Included effects: Entity
C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\model.py:1214: MissingValueWarning: 
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)

6. Let's now turn to listings¶

In this section we will prepare the scene for the main part of the analysis: Who competes with whom? How do rival prices impact the price charged for a listing? How does this depend on the number of rivals that a listing faces?

To address these questions, we will first determine which listings are direct competitors of each other. Instead of computing distances, as we did in class, we'll use Hierarchical Clustering as it is much less intensive in terms of memory usage and the size of the files that are generated. By using HC, we'll be able to identify listings that are near each other and belong to the same market/cluster. With this, we'll be able to determine which listings compete against each other.

Once we have the set of listings that compete with each other, we will compute, for each listing, the average price of its rivals' listings.

With all of this, we will examine how prices depend not only on the listing's attribute and the super host status of who manages it (and the number of listings that host manages), but also with the average price of the rival listings and the number of rivals in the cluster.

Let's get started. First step, determine which listings compete against each other. Create a matrix X as follows (adapt this code to your setting):

In [71]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
In [72]:
X = np.zeros((df6.shape[0],2))
X[:,0] = df6['latitude']
X[:,1] = df6['longitude']

Use the linkage function with method = centroid. Once you have the output:

  • plot the dendrogram,
  • use a distance measure of 0.005 to trim the tree and define clusters.

After doing so, make sure to do the following:

  1. How many clusters are there?
  2. Make a histogram of the distribution of the number of listings per cluster.
  3. What is the mean number of listings per market? Median? Standard deviation? Minimum? Maximum?
  4. Make a scatterplot of the data, using sns.scatterplot that uses hue to color code the clusters.
In [73]:
# Define the problem and use `linkage`
Z = linkage(X, method='centroid')
In [74]:
# Plot the dendrogram
plt.figure(figsize=(25,10))
plt.title('Hierarchal Clustering Dendrogram')
plt.xlabel('Index')
plt.ylabel('Distance')
dendrogram(
    Z,
    leaf_rotation=90.,
    leaf_font_size=8.,
)
plt.show()
    
In [75]:
# Define the clusters using a distance measure of 0.005
distance_threshold = 0.005
clusters = fcluster(Z, t=distance_threshold, criterion='distance')
In [76]:
# Add the cluster ids to `df5`
df5['cluster_id'] = clusters

6.1. How many clusters are there?¶

In [77]:
num_clusters = max(clusters)
f"The number of clusters is: {num_clusters}."
Out[77]:
'The number of clusters is: 468.'

6.2. Histogram of the number of listings per market.¶

In [78]:
cluster_counts = df5['cluster_id'].value_counts()

plt.figure(figsize=(25,10))
plt.bar(cluster_counts.index, cluster_counts.values, color='red')
plt.xlabel('cluster id')
plt.ylabel('Number of Listings')
plt.title('Distribution of Number of Listings per Cluster')
plt.show()

6.3 Summary statistics of listings per cluster¶

In [79]:
cluster_counts.describe()
Out[79]:
count    468.000000
mean       7.275641
std        9.373460
min        1.000000
25%        1.000000
50%        4.000000
75%        9.000000
max       61.000000
Name: cluster_id, dtype: float64

6.4. Scatterplot of clusters¶

In [80]:
cluster_counts
Out[80]:
107    61
105    51
110    48
128    48
106    46
       ..
341     1
389     1
380     1
290     1
293     1
Name: cluster_id, Length: 468, dtype: int64
In [81]:
cluster_counts_df = pd.DataFrame({'cluster_id': cluster_counts.index, 'number_of_Listings': cluster_counts.values})

plt.figure(figsize=(25, 10))
sns.scatterplot(data=cluster_counts_df, x='cluster_id', y='number_of_Listings', hue='cluster_id')
plt.title('Scatterplot of Number of Listings per Cluster')
plt.xlabel('cluster ID')
plt.ylabel('Number of Listings')
plt.show()

7. Add the cluster ID to each listing and compute the mean price of the cluster and the number of listings in the cluster¶

Now that we have the cluster ID for each observation, we can compute the average price of the cluster together with the number of observations that are in a cluster. Create df7 by merging df6 with the dataframe that has the mean price per cluster and the number of listings per cluster.

In [82]:
df6['cluster_id']=df5['cluster_id']
In [83]:
cluster_stats = df6.groupby('cluster_id')['price2'].agg(['mean', 'count']).reset_index()
cluster_stats.columns = ['cluster_id', 'mean_price', 'number_of_listings']
df7 = pd.merge(df6, cluster_stats, on='cluster_id', how='left')
In [84]:
df7
Out[84]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... science_museum point distance_to_the_bean distance_to_lincoln_park_zoo distance_to_botanic_garden distance_to_garfield_conservatory distance_to_science_museum cluster_id mean_price number_of_listings
0 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.90166 -87.68021) 6323.842702 5137.509592 12266.427808 4007.769743 10752.015508 194 142.05 20
1 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.91196 -87.63981) 1835.918505 645.955946 16744.020648 8500.159874 6272.529685 125 184.384615 26
2 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.92499 -87.65573) 3606.412223 2414.704591 14973.467194 6731.620259 8041.584980 22 227.727273 11
3 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.97960 -87.66512) 4666.399409 3468.863401 13916.954743 5701.480087 9103.333700 52 124.6 5
4 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.94342 -87.68121) 6440.393583 5248.836639 12140.954107 3905.207423 10873.631977 7 181.833333 30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3400 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.86309 -87.62668) 381.965796 858.650326 18215.047133 9959.287607 4799.461418 133 210.965517 29
3401 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.89183 -87.63820) 1652.200318 484.782145 16928.675828 8678.524792 6086.181005 115 244.365854 41
3402 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.87319 -87.63408) 1195.067467 222.496468 17390.673562 9135.896752 5623.819314 131 237.607143 28
3403 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.88723 -87.64413) 2311.432585 1135.778420 16271.812047 8018.773304 6742.687130 95 250.666667 36
3404 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.94847 -87.65723) 3780.016837 2584.357797 14800.601024 6568.821704 8216.681882 34 205.962963 27

3405 rows × 46 columns

8. Re-estimate the linear regressions with the cluster-level information¶

Before we re-run the regressions, note that some clusters only have one observations. For these clusters, we cannot include the average price of the rivals in the regressions. We need to exclude these.

In [85]:
# Define `df8` as the subset of df7 that has more than 1 listing.
df8 = df7[df7['number_of_listings'] > 1]
In [86]:
# Define a new `panel2` dataframe, add the time fixed effect, define the indexes, and re-estimate the linear regressions.

panel2 = df8[['host_id', 'bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'number_of_listings','time']]
panel2 = panel2.set_index(['host_id', 'time'])

panel2['constant'] = 1
panel2 = panel2.dropna()
dependent = panel2.price2
exog = sm.add_constant(panel2[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'number_of_listings']])
In [87]:
dependent_variable = panel2['price2']
independent_variables = panel2[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating','number_of_listings']]

modFE = PanelOLS(dependent_variable, independent_variables, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 price2   R-squared:                        0.8602
Estimator:                   PanelOLS   R-squared (Between):              0.8734
No. Observations:                2613   R-squared (Within):               0.0000
Date:                Mon, Dec 04 2023   R-squared (Overall):              0.8602
Time:                        05:37:05   Log-likelihood                -1.482e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      3209.4
Entities:                        1577   P-value                           0.0000
Avg Obs:                       1.6569   Distribution:                  F(5,2608)
Min Obs:                       1.0000                                           
Max Obs:                       29.000   F-statistic (robust):             3209.4
                                        P-value                           0.0000
Time periods:                       1   Distribution:                  F(5,2608)
Avg Obs:                       2613.0                                           
Min Obs:                       2613.0                                           
Max Obs:                       2613.0                                           
                                                                                
                                  Parameter Estimates                                   
========================================================================================
                      Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
----------------------------------------------------------------------------------------
bedrooms                 23.225     2.4844     9.3480     0.0000      18.353      28.096
beds                     11.954     1.6695     7.1607     0.0000      8.6808      15.228
number_of_reviews       -0.0727     0.0158    -4.6078     0.0000     -0.1037     -0.0418
review_scores_rating     13.549     0.8701     15.572     0.0000      11.843      15.255
number_of_listings       1.7910     0.1014     17.659     0.0000      1.5921      1.9898
========================================================================================


In [90]:
panel2
Out[90]:
bedrooms beds number_of_reviews review_scores_rating price2 host_is_superhost2 number_of_listings constant
host_id time
17928 1 1.0 1.0 505 4.70 92 True 20 1
33004 1 2.0 2.0 72 4.65 170 True 26 1
101521 1 2.0 3.0 51 4.30 95 True 11 1
504470 1 1.0 1.0 17 5.00 105 True 5 1
683529 1 3.0 3.0 72 4.78 204 False 30 1
... ... ... ... ... ... ... ... ... ...
518917937 1 1.0 2.0 1 1.00 136 False 37 1
243035563 1 2.0 3.0 1 5.00 248 False 48 1
66596522 1 1.0 3.0 1 5.00 230 False 36 1
235678953 1 2.0 3.0 1 3.00 216 True 36 1
67751745 1 2.0 2.0 1 5.00 367 False 9 1

2613 rows × 8 columns

In [91]:
dependent_variable = panel2['price2']
independent_variables = panel2[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'number_of_listings']]

modFE = PanelOLS(dependent_variable, independent_variables, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')

print(resFE)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 price2   R-squared:                        0.8602
Estimator:                   PanelOLS   R-squared (Between):              0.8734
No. Observations:                2613   R-squared (Within):               0.0000
Date:                Mon, Dec 04 2023   R-squared (Overall):              0.8602
Time:                        05:37:05   Log-likelihood                -1.482e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      3209.4
Entities:                        1577   P-value                           0.0000
Avg Obs:                       1.6569   Distribution:                  F(5,2608)
Min Obs:                       1.0000                                           
Max Obs:                       29.000   F-statistic (robust):             3209.4
                                        P-value                           0.0000
Time periods:                       1   Distribution:                  F(5,2608)
Avg Obs:                       2613.0                                           
Min Obs:                       2613.0                                           
Max Obs:                       2613.0                                           
                                                                                
                                  Parameter Estimates                                   
========================================================================================
                      Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
----------------------------------------------------------------------------------------
bedrooms                 23.225     2.4844     9.3480     0.0000      18.353      28.096
beds                     11.954     1.6695     7.1607     0.0000      8.6808      15.228
number_of_reviews       -0.0727     0.0158    -4.6078     0.0000     -0.1037     -0.0418
review_scores_rating     13.549     0.8701     15.572     0.0000      11.843      15.255
number_of_listings       1.7910     0.1014     17.659     0.0000      1.5921      1.9898
========================================================================================


9. Machine learning application¶

We'll now turn to the last part of the problem set. First, create a new variable that takes the value of one if the price of a listing is above the mean price of that of the market. Note that for clusters with a single listing this is not useful so we'll continue to work with the subset of the data that doesn't have clusters with a single listing. Note, also, that we'll need to clean and scale some variables before we can turn to predicting prices.

9.1. We need to clean the data a bit more¶

First, drop name, host_since, host_location, host_neighbourhood, neighbourhood_cleansed and host_acceptance_rate2 as we won't use them. Also drop host_id, accommodates, bedrooms and rows that have missing prices or missing review_scores_rating. Finally, of all the review* columns, keep review_scores_rating only and drop observations for which the host_is_superhost2 variable is missing..

In [92]:
df8
Out[92]:
id name host_id host_since host_location host_acceptance_rate host_is_superhost host_neighbourhood neighbourhood_cleansed latitude ... science_museum point distance_to_the_bean distance_to_lincoln_park_zoo distance_to_botanic_garden distance_to_garfield_conservatory distance_to_science_museum cluster_id mean_price number_of_listings
0 7126 Rental unit in Chicago · ★4.70 · 1 bedroom · 1... 17928 2009-05-19 Chicago, IL 95% t Ukrainian Village West Town 41.901660 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.90166 -87.68021) 6323.842702 5137.509592 12266.427808 4007.769743 10752.015508 194 142.05 20
1 10945 Rental unit in Chicago · ★4.65 · 2 bedrooms · ... 33004 2009-08-21 Chicago, IL 95% t Old Town Lincoln Park 41.911960 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.91196 -87.63981) 1835.918505 645.955946 16744.020648 8500.159874 6272.529685 125 184.384615 26
2 25879 Rental unit in Chicago · ★4.30 · 2 bedrooms · ... 101521 2010-03-31 Chicago, IL 90% t Lincoln Park Lincoln Park 41.924990 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.92499 -87.65573) 3606.412223 2414.704591 14973.467194 6731.620259 8041.584980 22 227.727273 11
3 94450 Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... 504470 2011-04-12 Chicago, IL 92% t Andersonville Edgewater 41.979600 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.97960 -87.66512) 4666.399409 3468.863401 13916.954743 5701.480087 9103.333700 52 124.6 5
4 145659 Rental unit in Chicago · ★4.78 · 3 bedrooms · ... 683529 2011-06-09 Chicago, IL 99% f Roscoe Village North Center 41.943420 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.94342 -87.68121) 6440.393583 5248.836639 12140.954107 3905.207423 10873.631977 7 181.833333 30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3400 975407160632841105 Rental unit in Chicago · ★New · 2 bedrooms · 2... 415825853 2021-07-29 Calumet Park, IL 96% t Near South Side Near South Side 41.863088 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.86309 -87.62668) 381.965796 858.650326 18215.047133 9959.287607 4799.461418 133 210.965517 29
3401 976106807063239067 Serviced apartment in Chicago · ★New · 1 bedro... 177188450 2018-03-07 Chicago, IL 90% f River North Near North Side 41.891832 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.89183 -87.63820) 1652.200318 484.782145 16928.675828 8678.524792 6086.181005 115 244.365854 41
3402 976881098303922237 Rental unit in Chicago · ★New · 1 bedroom · 1 ... 441260321 2022-01-19 Chicago, IL 100% t Downtown Los Angeles Loop 41.873189 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.87319 -87.63408) 1195.067467 222.496468 17390.673562 9135.896752 5623.819314 131 237.607143 28
3403 976914878462533348 Rental unit in Chicago · ★New · 2 bedrooms · 2... 97961430 2016-10-04 Chicago, IL NaN f West Loop Near West Side 41.887230 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.88723 -87.64413) 2311.432585 1135.778420 16271.812047 8018.773304 6742.687130 95 250.666667 36
3404 978264935480956783 Rental unit in Chicago · ★New · 3 bedrooms · 4... 518917937 2023-06-08 Chicago, IL 99% f West Town Lake View 41.948470 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.94847 -87.65723) 3780.016837 2584.357797 14800.601024 6568.821704 8216.681882 34 205.962963 27

3277 rows × 46 columns

In [93]:
columns_to_drop = ['name', 'host_since', 'host_location', 'host_neighbourhood', 'neighbourhood_cleansed', 'host_acceptance_rate2']
df9 = df8.drop(columns=columns_to_drop, inplace=False)
In [94]:
df9.columns
Out[94]:
Index(['id', 'host_id', 'host_acceptance_rate', 'host_is_superhost',
       'latitude', 'longitude', 'property_type', 'room_type', 'accommodates',
       'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'price2', 'host_is_superhost2', 'total_listings',
       'average_price', 'superhost_status', 'time', 'the_bean',
       'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory',
       'science_museum', 'point', 'distance_to_the_bean',
       'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
       'distance_to_garfield_conservatory', 'distance_to_science_museum',
       'cluster_id', 'mean_price', 'number_of_listings'],
      dtype='object')
In [95]:
df9 = df9.dropna(subset=['id', 'host_id', 'host_acceptance_rate', 'host_is_superhost',
       'latitude', 'longitude', 'property_type', 'room_type', 'accommodates',
       'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'price2', 'host_is_superhost2', 'total_listings',
       'average_price', 'superhost_status', 'time', 'the_bean',
       'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory',
       'science_museum', 'point', 'distance_to_the_bean',
       'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
       'distance_to_garfield_conservatory', 'distance_to_science_museum',
       'cluster_id', 'mean_price', 'number_of_listings'])
In [96]:
df9
Out[96]:
id host_id host_acceptance_rate host_is_superhost latitude longitude property_type room_type accommodates bedrooms ... science_museum point distance_to_the_bean distance_to_lincoln_park_zoo distance_to_botanic_garden distance_to_garfield_conservatory distance_to_science_museum cluster_id mean_price number_of_listings
0 7126 17928 95% t 41.901660 -87.680210 Entire rental unit Entire home/apt 2 1.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.90166 -87.68021) 6323.842702 5137.509592 12266.427808 4007.769743 10752.015508 194 142.05 20
1 10945 33004 95% t 41.911960 -87.639810 Entire rental unit Entire home/apt 4 2.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.91196 -87.63981) 1835.918505 645.955946 16744.020648 8500.159874 6272.529685 125 184.384615 26
2 25879 101521 90% t 41.924990 -87.655730 Entire rental unit Entire home/apt 6 2.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.92499 -87.65573) 3606.412223 2414.704591 14973.467194 6731.620259 8041.584980 22 227.727273 11
3 94450 504470 92% t 41.979600 -87.665120 Entire rental unit Entire home/apt 1 1.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.97960 -87.66512) 4666.399409 3468.863401 13916.954743 5701.480087 9103.333700 52 124.6 5
4 145659 683529 99% f 41.943420 -87.681210 Entire rental unit Entire home/apt 8 3.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.94342 -87.68121) 6440.393583 5248.836639 12140.954107 3905.207423 10873.631977 7 181.833333 30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3358 965968083633594916 518917937 99% f 41.955120 -87.651490 Entire rental unit Entire home/apt 3 1.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.95512 -87.65149) 3147.306075 1949.216084 15436.260635 7207.795509 7584.157672 30 138.405405 37
3365 967385536206513874 243035563 100% f 41.856172 -87.627915 Entire rental unit Entire home/apt 6 2.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.85617 -87.62792) 522.801616 741.922360 18080.299439 9822.420541 4934.306813 128 254.729167 48
3367 967598679107146992 66596522 100% f 41.891370 -87.628410 Entire rental unit Entire home/apt 5 1.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.89137 -87.62841) 564.779711 638.518824 18015.025569 9766.813293 5001.565470 113 233.972222 36
3378 969768212770420822 235678953 96% t 41.887883 -87.643916 Entire rental unit Entire home/apt 6 2.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.88788 -87.64392) 2287.656478 1111.788461 16295.362898 8042.588536 6719.130615 95 250.666667 36
3391 973015093684752148 67751745 99% f 41.896538 -87.638062 Entire rental unit Entire home/apt 6 2.0 ... POINT (41.79240081830397 -87.58362020107282) POINT (41.89654 -87.63806) 1637.883057 464.577391 16942.304100 8693.678335 6072.897973 112 243.555556 9

2426 rows × 40 columns

In [97]:
df9 = df9.dropna(subset=['host_id', 'accommodates', 'bedrooms', 'price2', 'review_scores_rating','host_is_superhost2'])

9.2. Create categorical variables¶

We now have to scale numeric variables to be between 0 and 1, and turn strings into exclusive dummy variables as we did in the ML handout. There will be one difference, however. Instead of creating boolean variables for the property_type, we'll create binary variables. This is, instead of the variable saying True or False, it should have a 1 or a 0. We'll do this because some methods do not handle boolean variables well. To do this, use np.where() as follows.

Assume you have a dataframe called dX that has a column called type that takes three possible values: a, b, and c. Then, you would create three binary indicators called, for example dX['da'], dX['db'], and dX['dc'] by doing dX['da'] = np.where(dX['type'] == 'a', 1, 0) where the order of the arguments says "if the condition dX['type'] == 'a' is true, then put a 1 in the new variable dX['da'] and a 0 otherwise. Adapt this to your case and use this to create df10. After you have done this, scale all the other numeric variables in df10 to be between 0 and 1.

Finally, add a column to df10 that will identify a price above the mean of the cluster.

Now we can turn to splitting the data.

9.3. Load relevant modules, split the data, and define the problems¶

In [98]:
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import MinMaxScaler
In [99]:
property_types = df9['property_type'].unique()
for prop_type in property_types:
    df9[f'is_{prop_type}'] = np.where(df9['property_type'] == prop_type, 1, 0)

df10 = df9.drop(columns=['property_type'])

numeric_columns = df10.select_dtypes(include=[np.number]).columns
scaler = MinMaxScaler()
df10[numeric_columns] = scaler.fit_transform(df10[numeric_columns])

df10
Out[99]:
id host_id host_acceptance_rate host_is_superhost latitude longitude room_type accommodates bedrooms beds ... distance_to_garfield_conservatory distance_to_science_museum cluster_id mean_price number_of_listings is_Entire rental unit is_Entire townhouse is_Entire guesthouse is_Entire serviced apartment is_Entire bungalow
0 0.000000e+00 0.000030 95% t 0.672639 0.559980 Entire home/apt 0.066667 0.000000 0.000000 ... 0.213154 0.366124 0.413276 0.271975 0.305085 1.0 0.0 0.0 0.0 0.0
1 3.924913e-15 0.000058 95% t 0.700683 0.695847 Entire home/apt 0.200000 0.166667 0.066667 ... 0.455644 0.212760 0.265525 0.451739 0.406780 1.0 0.0 0.0 0.0 0.0
2 1.927308e-14 0.000187 90% t 0.736161 0.642307 Entire home/apt 0.333333 0.166667 0.133333 ... 0.360182 0.273327 0.044968 0.635785 0.152542 1.0 0.0 0.0 0.0 0.0
3 8.974578e-14 0.000943 92% t 0.884853 0.610728 Entire home/apt 0.000000 0.000000 0.000000 ... 0.304577 0.309678 0.109208 0.197877 0.050847 1.0 0.0 0.0 0.0 0.0
4 1.423750e-13 0.001279 99% f 0.786342 0.556617 Entire home/apt 0.466667 0.333333 0.133333 ... 0.207618 0.370288 0.012848 0.440906 0.474576 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3358 9.927576e-01 0.974363 99% f 0.818199 0.656566 Entire home/apt 0.133333 0.000000 0.066667 ... 0.385885 0.257666 0.062099 0.256499 0.593220 1.0 0.0 0.0 0.0 0.0
3365 9.942143e-01 0.456341 100% f 0.548786 0.735848 Entire home/apt 0.333333 0.166667 0.133333 ... 0.527017 0.166943 0.271949 0.750442 0.779661 1.0 0.0 0.0 0.0 0.0
3367 9.944334e-01 0.125044 100% f 0.644621 0.734185 Entire home/apt 0.266667 0.000000 0.133333 ... 0.524015 0.169246 0.239829 0.662302 0.576271 1.0 0.0 0.0 0.0 0.0
3378 9.966631e-01 0.442528 96% t 0.635125 0.682038 Entire home/apt 0.333333 0.166667 0.133333 ... 0.430945 0.228050 0.201285 0.733192 0.576271 1.0 0.0 0.0 0.0 0.0
3391 1.000000e+00 0.127213 99% f 0.658694 0.701724 Entire home/apt 0.333333 0.166667 0.066667 ... 0.466090 0.205925 0.237687 0.702996 0.118644 1.0 0.0 0.0 0.0 0.0

2426 rows × 44 columns

In [100]:
property_types = df9['property_type'].unique()
for prop_type in property_types:
    df9[f'is_{prop_type}'] = np.where(df9['property_type'] == prop_type, 1, 0)

df10 = df9.drop(columns=['property_type'])
In [101]:
train_df, test_df = train_test_split(df10, test_size = 0.2, random_state = 1)

We'll estimate the default Ridge, Lasso, Decision Tree, and Random Forest regressors, and compute the relevant performance measure for each of them using 10-K fold cross-validation. Then we'll use a grid of parameters to find the best specification for each of them. Finally, we'll use a Decision Tree and a Random Forest classifier to predict whether the price of a listing will be above the mean of the cluster.

In [102]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

from sklearn.tree import export_graphviz
In [103]:
df.columns
Out[103]:
Index(['id', 'name', 'host_id', 'host_since', 'host_location',
       'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
       'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
       'room_type', 'accommodates', 'bedrooms', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'number_of_reviews',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'price2', 'host_acceptance_rate2',
       'host_is_superhost2'],
      dtype='object')
In [104]:
train_df.columns
Out[104]:
Index(['id', 'host_id', 'host_acceptance_rate', 'host_is_superhost',
       'latitude', 'longitude', 'room_type', 'accommodates', 'bedrooms',
       'beds', 'price', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'price2', 'host_is_superhost2', 'total_listings',
       'average_price', 'superhost_status', 'time', 'the_bean',
       'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory',
       'science_museum', 'point', 'distance_to_the_bean',
       'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
       'distance_to_garfield_conservatory', 'distance_to_science_museum',
       'cluster_id', 'mean_price', 'number_of_listings',
       'is_Entire rental unit', 'is_Entire townhouse', 'is_Entire guesthouse',
       'is_Entire serviced apartment', 'is_Entire bungalow'],
      dtype='object')
In [105]:
train_df
Out[105]:
id host_id host_acceptance_rate host_is_superhost latitude longitude room_type accommodates bedrooms beds ... distance_to_garfield_conservatory distance_to_science_museum cluster_id mean_price number_of_listings is_Entire rental unit is_Entire townhouse is_Entire guesthouse is_Entire serviced apartment is_Entire bungalow
1 10945 33004 95% t 41.911960 -87.639810 Entire home/apt 4 2.0 2.0 ... 8500.159874 6272.529685 125 184.384615 26 1 0 0 0 0
458 22886177 2658212 88% t 41.942980 -87.647620 Entire home/apt 4 2.0 2.0 ... 7635.563022 7150.458225 39 115.756757 37 1 0 0 0 0
1152 45952602 1969829 87% t 41.855245 -87.646794 Entire home/apt 6 2.0 2.0 ... 7723.450909 7030.640056 136 163.4 5 1 0 0 0 0
1794 590591161012149829 9923391 98% t 41.921867 -87.653091 Entire home/apt 4 1.0 2.0 ... 7024.582473 7747.925388 22 227.727273 11 1 0 0 0 0
1243 48466374 51946578 100% f 41.918230 -87.736340 Entire home/apt 6 3.0 3.0 ... 2240.184813 16991.262821 232 101.666667 3 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1302 49230136 166918192 90% t 41.935600 -87.642410 Entire home/apt 2 1.0 1.0 ... 8213.514279 6570.674704 20 103.5 26 1 0 0 0 0
1215 47875647 148973907 89% t 41.952570 -87.677560 Entire home/apt 2 1.0 1.0 ... 4312.968446 10471.537515 13 147.0 9 1 0 0 0 0
1484 51741865 233682638 100% t 41.946620 -87.659110 Entire home/apt 5 2.0 3.0 ... 6359.608544 8424.193052 34 205.962963 27 1 0 0 0 0
310 17399529 33127842 58% t 41.885240 -87.619110 Entire home/apt 6 2.0 3.0 ... 10800.847254 3969.864512 104 200.575758 33 0 0 0 1 0
1435 51171320 21451629 63% t 41.899930 -87.670470 Entire home/apt 6 3.0 3.0 ... 5090.484721 9669.933358 198 178.285714 28 0 1 0 0 0

1940 rows × 44 columns

In [106]:
fix_col = {
    'is_Entire rental unit': 'd1',
    'is_Entire townhouse': 'd2',
    'is_Entire guesthouse': 'd3',
    'is_Entire serviced apartment': 'd4',
    'is_Entire bungalow': 'd5',
}
train_df = train_df.rename(columns=fix_col)
In [107]:
train_df
Out[107]:
id host_id host_acceptance_rate host_is_superhost latitude longitude room_type accommodates bedrooms beds ... distance_to_garfield_conservatory distance_to_science_museum cluster_id mean_price number_of_listings d1 d2 d3 d4 d5
1 10945 33004 95% t 41.911960 -87.639810 Entire home/apt 4 2.0 2.0 ... 8500.159874 6272.529685 125 184.384615 26 1 0 0 0 0
458 22886177 2658212 88% t 41.942980 -87.647620 Entire home/apt 4 2.0 2.0 ... 7635.563022 7150.458225 39 115.756757 37 1 0 0 0 0
1152 45952602 1969829 87% t 41.855245 -87.646794 Entire home/apt 6 2.0 2.0 ... 7723.450909 7030.640056 136 163.4 5 1 0 0 0 0
1794 590591161012149829 9923391 98% t 41.921867 -87.653091 Entire home/apt 4 1.0 2.0 ... 7024.582473 7747.925388 22 227.727273 11 1 0 0 0 0
1243 48466374 51946578 100% f 41.918230 -87.736340 Entire home/apt 6 3.0 3.0 ... 2240.184813 16991.262821 232 101.666667 3 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1302 49230136 166918192 90% t 41.935600 -87.642410 Entire home/apt 2 1.0 1.0 ... 8213.514279 6570.674704 20 103.5 26 1 0 0 0 0
1215 47875647 148973907 89% t 41.952570 -87.677560 Entire home/apt 2 1.0 1.0 ... 4312.968446 10471.537515 13 147.0 9 1 0 0 0 0
1484 51741865 233682638 100% t 41.946620 -87.659110 Entire home/apt 5 2.0 3.0 ... 6359.608544 8424.193052 34 205.962963 27 1 0 0 0 0
310 17399529 33127842 58% t 41.885240 -87.619110 Entire home/apt 6 2.0 3.0 ... 10800.847254 3969.864512 104 200.575758 33 0 0 0 1 0
1435 51171320 21451629 63% t 41.899930 -87.670470 Entire home/apt 6 3.0 3.0 ... 5090.484721 9669.933358 198 178.285714 28 0 1 0 0 0

1940 rows × 44 columns

In [108]:
fix_col = {
    'is_Entire rental unit': 'd1',
    'is_Entire townhouse': 'd2',
    'is_Entire guesthouse': 'd3',
    'is_Entire serviced apartment': 'd4',
    'is_Entire bungalow': 'd5',
}

# Rename the columns in 'train_df'
test_df = test_df.rename(columns=fix_col)
In [109]:
# Note that this will only work if your variables are named the same as mine. Adapt this to your setting.

# Note: the 'dist_loc*' variables correspond to the variables that contain the distance from listings to the touristic locations 1, 2, etc.

label_train = train_df['price2']
features_train = train_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
       'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price','d1', 'd2', 'd3',
       'd4', 'd5','distance_to_the_bean',
       'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
       'distance_to_garfield_conservatory', 'distance_to_science_museum']]

label_test = test_df['price2']
features_test = test_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
       'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price', 'd1', 'd2', 'd3',
       'd4', 'd5','distance_to_the_bean',
       'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
       'distance_to_garfield_conservatory', 'distance_to_science_museum']]

9.3.1. Training the models (Regression)¶

In [110]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
In [111]:
dep_var = train_df['price2'].copy() # This is normally called "labels"
indep_var = train_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
       'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price','d1', 'd2', 'd3',
       'd4', 'd5','distance_to_the_bean',
       'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
       'distance_to_garfield_conservatory', 'distance_to_science_museum']]

dep_var_test = test_df['price2'].copy()
indep_var_test = test_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
       'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price', 'd1', 'd2', 'd3',
       'd4', 'd5','distance_to_the_bean',
       'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
       'distance_to_garfield_conservatory', 'distance_to_science_museum']]
In [112]:
indep_var = indep_var.dropna()
In [113]:
dep_var = dep_var.dropna()
In [114]:
dep_var = dep_var.loc[indep_var.index]
In [115]:
indep_var_test = indep_var_test.dropna()
dep_var_test = dep_var_test.dropna()
In [116]:
ridge_reg = Ridge(alpha=0.3)
ridge_reg.fit(indep_var, dep_var)
Out[116]:
Ridge(alpha=0.3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=0.3)
In [117]:
lasso_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
lasso_reg_cv = Lasso(alpha=1)
In [118]:
lasso_scores = cross_val_score(lasso_reg_cv, indep_var, dep_var, scoring='neg_mean_squared_error', cv=lasso_cv, n_jobs=-1)
In [119]:
# Take the absolute value of the scores
lasso_scores = np.abs(lasso_scores)
print('Mean MSE: %.3f (%.3f)' % (np.mean(lasso_scores), np.std(lasso_scores)))
Mean MSE: 3733.334 (440.457)
In [120]:
lasso_reg_cv2 = Lasso(alpha=0.0005)
lasso_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
lasso_scores2 = cross_val_score(lasso_reg_cv2, indep_var, dep_var, scoring='neg_mean_squared_error', cv=lasso_cv, n_jobs=-1)
lasso_scores2 = np.abs(lasso_scores2)
print('Mean MSE: %.3f (%.3f)' % (np.mean(lasso_scores2), np.std(lasso_scores2)))
Mean MSE: 3711.178 (413.325)
In [121]:
from sklearn.model_selection import GridSearchCV
In [122]:
# Note that we do not include alpha when declaring the model!
lasso_reg_cv3 = Lasso() 

# This one is the same as before so we don't really need to re-define it.
lasso_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# Define grid
grid = dict()
grid['alpha'] = np.arange(0.001, 1, 0.01)

# Define the search
search = GridSearchCV(lasso_reg_cv3, grid, scoring='neg_mean_squared_error', cv=lasso_cv, n_jobs=-1)

# Perform the search
results = search.fit(indep_var,  dep_var)
# summarize
print('MSE: %.3f' % np.abs(results.best_score_))
print('Config: %s' % results.best_params_)
MSE: 3710.192
Config: {'alpha': 0.17099999999999999}
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.803e+04, tolerance: 1.275e+03
  model = cd_fast.enet_coordinate_descent(
In [123]:
# Note that we do not include alpha when declaring the model!
ridge_reg_cv = Ridge() 

# This one is the same as before so we don't really need to re-define it.
ridge_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# Define grid
grid = dict()
grid['alpha'] = np.arange(0.001, 1, 0.01)

# Define the search
search = GridSearchCV(ridge_reg_cv, grid, scoring='neg_mean_squared_error', cv=ridge_cv, n_jobs=-1)

# Perform the search
resultsR = search.fit(indep_var,  dep_var)
# summarize
print('MSE: %.3f' % np.abs(resultsR.best_score_))
print('Config: %s' % resultsR.best_params_)
MSE: 3710.779
Config: {'alpha': 0.9909999999999999}
In [124]:
# Define and train a default Lasso regression model
lasso_reg = Lasso()
lasso_reg.fit(indep_var, dep_var)
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.852e+03, tolerance: 1.275e+03
  model = cd_fast.enet_coordinate_descent(
Out[124]:
Lasso()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso()
In [125]:
lasso_prices = lasso_reg.predict(indep_var_test)
In [126]:
# Define and train a default Decision Tree regression model
dtree_reg = DecisionTreeRegressor()
dtree_reg.fit(indep_var, dep_var)
Out[126]:
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor()
In [127]:
dtree_prices = dtree_reg.predict(indep_var_test)
In [128]:
fig, axs = plt.subplots(figsize=(15,6))
sns.scatterplot(x = dep_var, y = dtree_reg.predict(indep_var), label = r'Decision tree', alpha = 0.3)
plt.show()
In [129]:
# Define and train a default Random Forests regression model
random_forest_reg = RandomForestRegressor()
random_forest_reg.fit(indep_var, dep_var)
Out[129]:
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()
In [130]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
In [131]:
# Use cross validation for the four models you estimated above.
# Ridge Regression
ridge_reg = Ridge()
ridge_scores = cross_val_score(ridge_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
ridge_rmse_scores = np.sqrt(-ridge_scores)
print("Ridge Regression RMSE:", ridge_rmse_scores.mean())

# Lasso Regression
lasso_reg = Lasso()
lasso_scores = cross_val_score(lasso_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
lasso_rmse_scores = np.sqrt(-lasso_scores)
print("Lasso Regression RMSE:", lasso_rmse_scores.mean())

# Decision Tree Regression
decision_tree_reg = DecisionTreeRegressor()
tree_scores = cross_val_score(decision_tree_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
tree_rmse_scores = np.sqrt(-tree_scores)
print("Decision Tree Regression RMSE:", tree_rmse_scores.mean())

# Random Forest Regression
random_forest_reg = RandomForestRegressor()
forest_scores = cross_val_score(random_forest_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
forest_rmse_scores = np.sqrt(-forest_scores)
print("Random Forest Regression RMSE:", forest_rmse_scores.mean())
Ridge Regression RMSE: 60.82055418742339
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.018e+03, tolerance: 1.150e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.702e+03, tolerance: 1.144e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.954e+03, tolerance: 1.163e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.030e+03, tolerance: 1.120e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.502e+04, tolerance: 1.133e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.547e+03, tolerance: 1.140e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.193e+03, tolerance: 1.168e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.310e+03, tolerance: 1.165e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.966e+03, tolerance: 1.147e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.167e+03, tolerance: 1.144e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.039e+03, tolerance: 1.166e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.463e+03, tolerance: 1.158e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.063e+04, tolerance: 1.140e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.805e+03, tolerance: 1.140e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.234e+03, tolerance: 1.144e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.054e+04, tolerance: 1.154e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.899e+03, tolerance: 1.114e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.112e+03, tolerance: 1.158e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.546e+03, tolerance: 1.178e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.154e+04, tolerance: 1.145e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.572e+03, tolerance: 1.176e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.571e+03, tolerance: 1.129e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.334e+03, tolerance: 1.164e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.624e+03, tolerance: 1.123e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.049e+03, tolerance: 1.142e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.519e+03, tolerance: 1.137e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.981e+03, tolerance: 1.126e+03
  model = cd_fast.enet_coordinate_descent(
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.214e+03, tolerance: 1.158e+03
  model = cd_fast.enet_coordinate_descent(
Lasso Regression RMSE: 60.99349634514588
Decision Tree Regression RMSE: 79.69735363221432
Random Forest Regression RMSE: 58.26056643681418

10. Save the train and test data, as well as df6 and df10.¶

In [609]:
train_df.to_csv('train_data.csv', index=False)
test_df.to_csv('test_data.csv', index=False)

df6.to_csv('df6.csv', index=False)
df10.to_csv('df10.csv', index=False)
In [ ]:
 

11. Export your notebook, with all the results, as a .html file (Go to File -> Save and export notebook as -> HTML) and upload the resulting HTML file to Canvas. Do not upload your notebook (not the ipynb file but the html file)¶

In [ ]: